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Am29300 Family Overview 


Advanced Micro Devices has developed a new VLSI family to 
support very high performance applications in general purpose 
computation, intelligent peripheral controllers and array and 
digital signal processing—the Am29300 family. 


The family features high performance, greatly increased 
functionality relative to earlier approaches, and a high degree 
of architectural flexibility. 


32-BIT VLSI 


Historically, the Am2901 made a radical departure from con- 
ventional MSI functions by integrating several elements of a 
CPU into a vertical 4-bit slice. The combination of memory and 
ALU logic in a single package offered the user added 
functionality, reduced package count and data-path width 
flexibility. 

The Am29300 family reverses the trend of vertical slice 
partitioning by integrating complete 32-bit functions into single 
VLSI devices. 
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There are several reasons for the choice of a wider data path. 
First, cycle time is improved significantly if carry lookahead is 
contained entirely on the the chip. Second, certain powerful 
on-chip functions, such as the funnel shifter, priority encoder 
and mask generator are extremely difficult to expand when 
using vertical slices. Third, a higher level of integration leads to 
a more cost-effective system solution. The wider data path 
also affords greater I/O bandwidth, higher precision and in- 
creased memory addressability. These and other advantages 
contributed to the decision to make a family of complete 32-bit 
functions rather than slices. 


The Am29300 family currently consists of five members: 


Am29332 32-Bit ALU 

Am29331 =16-Bit Microinterruptible Sequencer 
Am29334 62x18 Dual-Access Four-Port Register File 
Am29325 = 32-Bit Floating Point Processor 

Am29323 32x32 Parallel Multiplier 


Am29334 
64 x 18 
REGISTER 


Am29323 
32-BIT PARALLEL 
ALU MULTIPLIER 


Figure 1. Am29300 Family High Performance System Block Diagram 





FUNCTIONAL PARTITIONING 
MORE EFFICIENT 


The Am29300 family departs from vertically partitioned bit-slice 
functions because it is divided into larger, horizontally parti- 
tioned building blocks. The ALU no longer contains a register 
file. Instead, there is a more flexible stand-alone register file, 
the Am29334, making expansion and regular addressing much 
easier. 

The new partitioning resulis in a number of benefits. The user 
gets a powerful processor with two uncommitted input buses 
and gains the flexibility of adding storage elements to these 
buses. The overall organization is more structured. Also, a 
larger power budget is available for the register file thus 
making it faster and bigger than if it had been in the processor 
chip. Functional partitioning results in an open system, giving 
the designer the ability to easily connect external components, 
e.g. memory components or arithmetic accelerators. Also, 
each of the Am29300 components, while designed to work 
together in a system, can be used as a standalone functional 
block. 


THREE-BUS FLOW-THROUGH 
ARCHITECTURE 


The Am29300 family features a three-bus flow-through archi- 
tecture. The Am29332 ALU, Am29325 Floating Point Proces- 
sor and Am29323 Multiplier all have two input buses and one 
output bus. This contributes to high throughput by eliminating 
bus bottlenecks caused by turnaround delays. It provides un- 
limited register file expansion and regular addressability. 


Moreover, the unlimited bus accessibility gives the designer 
the ability to configure the optimal micro-architecture for the 
application. If the design objectives change, the micro- 
architecture can be easily reconfigured. The three-bus configu- 
ration also supports concurrent processing and pipelined 
architectures. 


BALANCED TIMING 


In previous generations of microprogrammed systems, the 
control path containing the sequencer has been the bottleneck 
because the sequencer was usually slower than the associ- 
ated data path. Not so in the Am29300 family. The Am29331 
sequencer has been designed so that the entire system timing 
is balanced between the control path and the data path leading 
to higher overall throughput. 


POWERFUL INSTRUCTION SETS 


Each device in the family executes its instructions in a single 
cycle. 


The Am29332's instruction set is symmetric and orthogonal. 
Symmetric means that an operation that can be executed on 


port A can also be executed on port B and vice versa. Orthogo- 
nal means that all operations are independent of the data type. 
The Am29332 can operate both on multibyte data and on 
variable-width field data. This regularity of the Am29332's in- 
struction set makes it easy to create “clean” interfaces to com- 
pilers for high tevel language support. 


The Am29331’s instruction set is comprised of instructions that 
resemble high level language constructs. This makes it possi- 
ble to write structured microprograms. 


COMPLETE INTERLOCKING 
FAULT DETECTION 


The family supports both master/slave fault detection and data 
path parity to enhance system reliability by ensuring data in- 
tegrity and correct hardware operation. 


The system features byte parity checking on the inputs and 
byte parity generation on the outputs of the Am29332 ALU and 
the Am29323 Multiplier. Also, the organization of the Am29334 
64x18 register file accommodates parity bits for each byte. 
The parity mechanism assures data path integrity. 


Major functional blocks—the Am29332 ALU, Am29331 Se- 
quencer -and Am29323 Multiplier—also have master/slave 
fault detection to ensure correct device operation without 
having to carry parity through complex internal logic and with- 
out having to pay the resulting delay penalties. In master/slave 
mode, two functional units are connected in parallel with one 
unit performing the actual operation and the other checking the 
result, on a cycle-by-cycle, bit-by-bit basis. 


The master is used in the normal data path. In the slave, 
however, all outputs become inputs, and the slave compares 
the outputs of the master with its own internally generated 
result. If the two don’t match, an error signal is generated, 
which can trigger an interrupt at the microinstruction level. No 
specialized software is required. Also, the designer can 
choose to impose redundancy at the component or board 
level. 


The parity and the master/slave provisions comprise a com- 
plete interlocking fault detection mechanism. Using cost- 
effective hardware rather than expensive software, they 
provide a comprehensive solution for fault tolerant systems. 


PERFORMANCE/FLEXIBILITY/INTEGRATION 


The Am29300 family achieves high performance and high inte- 
gration but avoids architectural or pipelining restrictions. These 
become especially important in high performance parallel ar- 
chitectures or in emulations where the system is being opti- 
mized for particular instructions or processes. 


The ECL-internal, TTL lO Am29300 family minimizes the 
requirement for external components and achieves a system 
cycle time of well under 100 nsec. 





32-bit bipolar 
building blocks 
debut at AMD 


Alex Mendelsohn 
Editor-in-Chief 


INTEGRATED CIRCUITS MAGAZINE 


Reprinted with permission from Hearst Business Communications, Inc., November 1984, Integrated Circuits, all rights reserved. 


T last six months have 
seen a procession of ad- 
vanced microprocessors and pe- 
ripheral support chips making 
the leap from NMOS fabrication 
technologies to CMOS. Although 
systems designers can now im- 
plement circuits with fast-run- 
ning machines that dissipate less 
power than their NMOS for- 
bears, true speed-demons still 
opt for bipolar devices. 

Witness the success of the Ad- 
vanced Micro Devices Type Am- 
29116, a microprogrammable 16- 
bit bipolar microprocessor whose 
100 nanosecond microcycle bit 
slice speed has been an attractive 
calling card in recent years for 
designers seeking maximum sys- 
tem throughput. Most designers 
using the 29116 haven’t felt 
compromised by power dissipa- 
tion and supply requirements— 
the tradeoff has been a fair one. 

AMD is now at it again, but 
in addition to bipolar speed, 
AMD’s latest chip set features 
an open 32-bit wide ‘building 
block” register file/ALU archi- 
tecture that lends itself to unique 
general purpose implementa- 
tions, freeing you from cast-in- 
silicon approaches. 

Targeting designers looking 
for blazing speed for projects like 
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parity-equipped fault tolerant 
processors, advanced graphics 
systems, image processors, large 
register file based RISC ma- 
chines, and high-throughput 
simulators, AMD has just an- 
nounced a “superset” of the flex- 
ibility of the venerable 29116. 
This first ECL-internal/TTL-I/O 
32-bit architecture is the 29300 
Family. 

Partitioned for performance 
into five individual bipolar de- 
vices, the 29300 offers aone-chip 
ALU with access to three 32-bit 
buses. You, as a designer, can 
thus arrange your own unique 
system as you see fit. The 29300 
busing provides a“ flow through” 
architecture and virtually unlim- 
ited bus accessibility and register 
file expansion. No bidirectional 
busing is used, and apparently 
AMD chip designers were not 
concerned with conserving pack- 
age pins. 

An all-important orthogonal 
instruction set facilitates struc- 
tured micrprogramming, permit- 
ting the machine to execute a 
number of functions on each mi- 
crocycle in a regular symmetric 
way. Pins are available to tell 
whether an operation is byte 
width or a 16-, 24-, or 32-bit op- 
eration. No coding changes are 
required to perform at the byte 
or at the 32-bit level. The com- 
piler is therefore very easy to 
generate, without exception han- 
dling complexities; a high level 
language interface is thus a 
“clean” one. 

All instructions execute in sin- 
gle machine cycles. During one 
such cycle a 29300 system can 
do as much as it would take six 
or seven cycles to perform in one 
of today’s MOS machines. For 
example, a shift and rotate could 
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be combined with logical-ORs, 
something a 68020, even with its 
on-chip cache, would need mul- 
tiple cycles to perform. 

Functionally, the 29300-fam- 
ily is horizontally partitioned to 
provide faster processing than 
previous vertically partitioned 
bit-slice approaches. Five bipolar 
VLSI circuits are to be introduced 
between now and next summer. 
AMD has already seen first sili- 
con on one of the elements, a 32- 
bit math processor. 

The five ICs are: a 32-bit arith- 
metic logic unit, dubbed the Am- 
29332; a four-port dual-access 
64-by-18 register file—the Am- 
29334; the aforementioned high 
speed floating point processor, 
the Am29325; a Type Am29323 
32-bit parallel multiplier with 
two Read and two Write ports; 
and lastly, a Type Am29331 16- 
bit microprogram sequencer. 
Let’s take a look at each. 
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EXTERNAL 
MEMORY 
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Am29332 
32-BIT ALU 


The 29332 arithmetic logic 
unit (ALU) is a 32-bit wide non- 
slice three bus IC that allows in- 
tegration of functions that nor- 
mally don’t slice. Examples of 
these include shifters, priority 
encoders, and mask generators. 
Instructions are tailored to take 
advantage of these internal 
blocks (offering field logical op- 
erations or concatenation across 
word boundaries). 


The Heart of the System 


Cycle time for all 29332 ALU 
instructions are equal. Pipelined 
registers are avoided so that in- 
dividual designers can build-in 
pipelining or not, according to 
their own schemes, paying no 
penalties for branching. The 
three bus architecture also allows 
ready design of parallel and re- 
configurable architectures. The 
off-chip register file ensures un- 
limited expansion and regular 
addressability. 


Am29323 
32 X 32 PARALLEL 
MULTIPLIER 


The 29332 also includes a 
unique 64-bit in/32-bit out fun- 
nel shifter block. It allows n-bit 
shift-up/down as well as a 32- 
bit barrel shifts (see Integrated 
Circuits Magazine, Jan./Feb. ‘84, 
page 34). The funnel shifter also 
permits 32-bit field extraction in 
conjunction with the mask gen- 
erator. These unique functions 
can be combined with all logical 
instructions within the same cy- 
cle and with no increases in cy- 
cle time. 

Shift control for the above 
functions can come from an ex- 
ternal source or from the internal 
status register (generated on a 
previous instruction)—a useful 
feature for logical operations be- 
tween non-aligned variable- 
length fields. It can also be used 
for floating point normalization. 

Use of internal position and 
width status register fields can 
save eleven bits of microcode 
width. As mentioned previously, 
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In evaluating the new AMD 
29300-family architecture, it 
may pay to look at some of the 
competitive devices recently 
introduced by other IC ven- 
dors. For example, Texas In- 
struments has been expanding 
their low-voltage high speed 
small-transistor advanced 
Schottky (AS) bipolar line to 
include a 20 MHz 8-bit slice. 

Their 74AS888 features a 
parallel 8-bit ALU with ex- 
pansion inputs and outputs, a 
16 x 8 register file, and handles 
bit, byte, and word length op- 
erations. When used with their 
new 25 nanosecond 74AS890 
microinstruction sequencer, 
architectures can be built with- 
out limit (i.e. 64-bits wide). 

The 74AS890 controller has 
an address width of 14-bits, 
and can thus address up to 
16,384 words of microcode. 
These ICs are designed to 
implement systems with nar- 
row microcode word widths 
and very high throughput, and 
as such, should compete fav- 
orably. 

Also in the realm of ALUs, 
but in MOS technologies, Ana- 
log Devices (Norwood, Mas- 
sachusetts) has recently an- 
nounced their 16-bit Type 
ADSP-1201. This device, as 
the name suggests, is targeted 
at digital signal processor 
(DSP) designers. Similarly, 
Weitek, of Santa Clara, Cali- 
fornia, introduced an NMOS 
two chip set early this year; 
their 32-bit data path Type 
WTL1032 multiplier and 
WTL1033 ALU. 

Both Weitek’s and Analog 
Devices’ ICs, while excellent. 
for DSP applications, are 
somewhat limited for non-DSP 
circuits because of their exten- 
sive pipelining. Both include 
registered inputs and outputs, 
thus they both require more 
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A LOOK AT COMPETITIVE ALUs 


than one cycle to get results 
“off” the chip. Operands are 
entered, and the results taken, 
from the Weitek chip in about 
125 nanoseconds, for example. 

The ADSP-1201 also in- 
cludes a single port 8-word 
register file in each data path, 
however this precludes expand- 
ability. The chip does include a 
barrel shifter, but use of the 
shifter and the ALU com- 
binatorially in the same cycle 
isn’t possible. 

In contrast, 29332 ALU de- 
signers avoided the use of 
pipelining. There are no built- 
in penalties for branching. The 
simplicity of the three-bus 
ALU should also allow very Da 
easy implementation of parallel 
or reconfigurable architectures. 
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Another consideration in the 
comparison: the 29332's off- 
chip register file allows unlim- 
ited and regular addressability. 
In contrast, the ADSP-1201 
has a single port eight-word 
register file in each input data 
path. These are not expandable. 

The 29332 supports one-, 
two-, three-, and four-byte 
data for arithmetic and logic 
functions as well as multipre- 
cision arithmetic and multiple- 
bit shift operations. Neither 
Analog Devices’ nor Weitek’s 
ALUs support all data types for 
arithmetic operations. Neither 
can they support field logical 
operations as used in graphics. 
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The vanguard of the 2930 
family is the 29325 floating 
point processor. Here’s a photo- 
micrograph of first silicon. 


the 29332 also supports one-, 
two-, three-, or four-byte data as 
well as multiprecision arithmetic 
and multiple bit shift operations. 

For logical operations, the 
29332 can accommodate variable 
length fields up to 32-bits. When 
fewer than four bytes are selec- 
ted, the unselected bits are 
passed to the destination without 
modification. Support of all data 
types is highly important for 
arithmetic operations; field logi- 
cal operations are very necessary 
for applications such as graphics. 

Support is also provided for 
two-bit at a time modified 
Booth’s algorithm multiplication 
and one-bit at a time divide with 
both signed and unsigned in- 
tegers. Parity checking on data 
inputs and generation on out- 
puts, plus master/slave fault de- 
tection enhances applicability in 
fault tolerant systems. 


Separate Register File 

The Type 29334, the 20 nano- 
second access time RAM register 
file chip, supports these parity 
equipped designs. A byte parity 
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storage feature, with a width of 
18-bits, provides consistency 
with the ALU for parity check 
and generation. No other multi- 
port register file presently on the 
market offers this parity storage. 

The 29334, with its two Read 
and two Write ports, can read 
and write simultaneously on 
both. It is cascadable to support 
wider word widths or to form 
deeper register files, or both. You 
can use multiple 29334’s in an in- 
terleaved configuration, or you 
can build one file as high and as 
wide as you like. Write enable 
timing and multiplexer selection 
are derived from a single-phase 
clock; the MUX eliminates one 
“layer” of I/O delay. Also, indi- 
vidual byte write enables allow 
choice of either an 8- or 16-bit 
data interface. 


Math Chip Here and Now 

The single-chip 144-pin Type 
29325 floating point processor— 
the first VLSI family member to 
emerge from fab—performs very 
fast 32-bit single precision addi- 
tion, subtraction, and multiplica- 
tion. It conforms to the proposed 
IEEE P754 standard and also to 
Digital Equipment Corporation’s 
(DEC) format. 

Options for conversion be- 
tween the 32-bit integer format 
and floating point are available, 
as are operations for converting 
between IEEE and DEC. Execut- 
ing all instructions in a single 
cycle, the 29325’s throughput 
equates to 8 MHz, regardless of 
the algorithm. 

The 29325 features three 32- 
bit wide non-multiplexed buses 
for high I/O bandwidth. The use 
of two 32-bit operand feedfor- 
ward data paths on-chip support 
accumulation operations, includ- 
ing sum-of-products and New- 
ton-Raphson division. All buses 
are registered and each has a 
clock enable. Registers can be 
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independently transparent to 
eliminate unwanted pipelining if 
desired. Software synchroniza- 
tion of pipelining is not needed 
because there are no multi-stage 
pipeline structures. 


Go Forth and Multiply 


Although the Type 29323 
multiplier chip is still in the de- 
sign stage, and no working sili- 
con has yet been seen, AMD chip 
designers are confident that the 
device will be able to perform a 
32 by 32 multiply in only two cy- 
cles. They’re hoping to achieve 
an 80 nanosecond clock-to-clock 
multiply time. 

With two 32-bit input and one 
32-bit output ports, there is no 
need to multiplex operands. 
The chip will be controlled by 
only one clock with individual 
register enables, thus leading to 
simple timing requirements. 
Dual input port registers will en- 
able multiprecision multiplica- 
tion (a 32-bit multiply will occur 
in one cycle; a 64-bit in four). 

Like the floating point chip, 
the 29323 multiplier’s registers 
can be made independently 
transparent tc eliminate unwan- 
ted pipeline delays in non-pipe- 
lined systems. A master/slave 
mode allows two 29323’s to op- 
erate in parallel. A parity check/ 
generate feature will catch inter- 
device errors. 


Advanced Program Control 


The remaining chip in the 
29300-family will be the dual 
bus 29331 microprogram se- 
quencer. Controlling the se- 
quence of microinstructions 
stored in microprogram memory, 
the 29331 aids structured micro- 
programming, handling se- 
quential execution, branches, 
subroutines, and loops. It can ac- 
cess up to 64 Kwords of micro- 
code, and integrates otherwise 
external critical-path conditional 
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test logic into internal high speed 
gates to improve microcycle time. 
There is no branching penalty. 

The 29331 generates inequal- 
ity evaluation branch conditions 
from four ALU status bits. It fea- 
tures an eight external-test-con- 
dition multiplexer plus parity 
control. An address comparator 
allows breakpoint in the micro- 
code for debug or for gathering 
run-time statistics. This latter 
feature is something AMD has 
identified that most systems 
builders want but, so far, no ven- 
dor has implemented in silicon. 

Other 29331 features include 
four sets of 4-bit multiway in- 
puts to implement table look-up 
or to use external conditions as 
part of a branch address, master/ 
slave error checking for parallel 
sequencer operation, real-time 
interrupt support, trap handling 
at any microinstruction bound- 
ary, and a 32-level stack. 
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The latter provides the ability 
to support interrupts and loops 
as well as subroutine nesting. 
The stack can be read to support 
diagnostics or to run multi-task- 
ing at the micro-architecture lev- 
el. The chip’s instruction set is 
designed to resemble high level 
language constructs. 


Development Support 


AMD expects most customers 
will use Tektronix, Hewlett- 
Packard, or AMD development 
systems for microcode develop- 
ment, and AMD is introducing 
“M29” software that will run on 
VAX-size mini. 

M29 uses a description lan- 
guage that can describe a variety 
of architectures. It consists of 
three programs: a microinstruc- 
tion definition program, an as- 
sembler, and a relocation linker. 

The definition program creates 
a file that describes each micro- 
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instruction field by field, defin- 
ing its name and length, its fields 
and variations in format, and al- 
lowable values for each field. 
The file allows the assembler to 
be retargeted to support many 
different instruction formats. 

The assembler allows you to 
create microcode in several styles 
depending on the amount of ef- 
fort invested in the design of 
macros. The linker for the as- 
sembler is used to relocate as- 
sembly modules and link them 
together. Its output can be loaded 
into a writable control store or 
burned into PROMs. 

For more details call AMD at 
408-732-2400, or use the Reader 
Service card. z 


NOVEMBER 1984 









32-Bit ICs Enhance Array 
Processor Performance 


by Dave Wilson, Executive Editor 
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uture high performance proces- 
- sors/controllers require faster 
processing rates, higher machine 
densities, and greater system reliability. 
The need for virtual memory support, in- 
creased memory bandwidth and im- 
proved precision means a growing de- 
mand for 32-bit performance. Advanced 
Micro Devices’ (Sunnyvale, CA) Am- 
29300 family has been developed to ad- 
dress these needs in general purpose 
computation, intelligent peripheral con- 
trol, and array and digital signal process- 
ing applications. 

The Am29300 family evolved from the 
industry standard Am2900 bit-sliced 
family. A great number of the functional 
enhancements are the result of user feed- 
back from existing Am2900-based de- 
signs. On the other hand, the Am29300 
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family has been designed from the ground 
up for higher performance and architec- 
tural flexibility. The Am29300 devices 
have internal ECL circuitry for speed, yet 
maintain TTL compatible inputs/outputs 
for ease of interface. A 32-bit Am29300 
microprogrammed system has a system 
microcycle time of 70 to 80 nsec. In addi- 
tion, the devices have regular and or- 
thogonal instruction sets and contain 
built-in primitives to tackle crucial 
system issues such as fault tolerance/ 
detection. 

AMD offers several support chips for 
Am29300-based designs. For instance, 
systems requiring a 32-bit data path can 
be configured with the following devices: 
the Am29332 Integer Processor, a 32-bit 
arithmetic/logic and shift unit with built- 
in support for variable byte and bit field 





Photomicrograph of AMD's 29325 floating 
point processor. 


data; the Am29334 Register File, a true 
dual-ported register file which allows 
simultaneous read and write accesses, 
organized as 64 words by 18 bits; the 
Am29323 Parallel Multiplier, a 32 x 32 
parallel multiplier capable of multiple cy- 
cle expansion to 64 x 64 and 128 x 128 
without the use of external logic; and the 
Am29325 Floating Point Processor, 
which performs single cycle addition, 
subtraction, multiplication and conver- 
sions, using either the single precision 
IEEE or DEC format. Each of the above 
devices can be used in conjunction with 
or independent of the others. These de- 
vices can be configured in a variety of 
ways to tailor them to a specific applica- 
tion. All of the data path elements have 
single cycle instructions. The 
microinstructions are typically supplied 
by the control path. The key control path 
element is the Am29331 Microprogram 
Sequencer which supplies the next ad- 
dress to the contro! memory. The 16-bit 
Am2933] is also capable of handling in- 
terrupts or traps at the microinstruction 
level. 

Historically, the Am2900 devices have 
been partitioned vertically, combining 
register file and ALU ina single package. 
The Am29300 devices, however, are par- 
titioned horizontally, so that the register 
file is separated from the rest of the data 
path elements. The functional partition- 
ing has two advantages. First, it allows for 
an easily expandable register file space. 
Second, it also enables arithmetic ac- 
celerators to add to the data path. 

A major disadvantage of a bit-sliced ar- 
chitecture is the time lost in transmitting 
carries from one chip toanother. To avoid 
this, the Am29300 family arithmetic ele- 
ments are constructed with full internal 
32-bit data paths. Although the ALU has 
limited carry capability for cascading, it 
will normally perform multi-precision 
expansion through multiple cycle opera- 


Reprinted with permission from Digital Design, Vol. 14, No. 11; copyright Morgan-Grampian Publishing Company, November 1984. 
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tions. The other data path elements use 
this scheme exclusively. 

A second benefit of the full 32-bit data 
path elements is that they can include 
functions not easily sliced. Two classic 
examples of this are shift arrays and 
multipliers. Both of these require an 
unacceptable amount of information to be 
transferred between slices. Other func- 
tions, such as prioritization and mask 
generation for byte and word operation, 
while feasible, expand clumsily. All these 
functions are provided in the Am29300 
family, either in the ALU, or in the 32-bit 
multiplier. 


Three-Bus Flow 
Through Architecture 


In order to fully exploit the 32-bit data 
path devices, it is necessary to provide 
adequate data transfer bandwidth. In the 
Am29300 family this is achieved through 
the use of a three-bus architecture. The 
Am29334 Register File is a true two- 
ported file, allowing simultaneous access 
from each port. Output latches are pro- 
vided to allow read and write operation 
within a single clock cycle. Each of the 
data path elements has two 32-bit oper- 
and input buses which can be sourced 
from the register file. The data path 
elements also have a 32-bit result bus 
which can return data to one input of the 
register file. With this organization, a 
three-address register to register opera- 
tion may be completed within a single 
clock cycle. 

Two-register files may be used to 
achieve still higher bandwidth. Connect- 
ing the input ports in parallel, and writing 
duplicate data into the two files, allows 
four operands to be sourced simultane- 
ously froma single database. Two results 
may also be written into the file simulta- 
neously. This provides adequate data 


Data ALU 


Figure 1: RAM with 4 
read and 2 write ports. 


transfer for two groups of arithmetic 
elements to operate concurrently (Figure 
1). The flexibility of this three-bus archi- 
tecture also allows the use of these parts 
in other configurations. Ina signal or ar- 
ray processing application, the multiplier 
and ALU may be placed in series rather 
than parallel. This provides a “free 
operand,” allowing the three-operand 
summation of products operation to pro- 
ceed at maximum speed. 

The cycle time of a microprogrammed 
system is dependent on both the control 
path (i.e., sequencer and microprogram 
memory) and the data path (i.e., register 
fileand ALU). Traditionally, the system 
bottleneck has been the control path, 
especially the timing paths associated 
with conditional branching. The 16-bit 
Am29331 Microprogram Sequencer has 
been optimized for speed, so that the data 
path and control path timing are balanc- 
ed. The previously external condition 
code multiplexer, test logic generator and 
polarity control logic (usually the system 
critical path), have been integrated on 
chip. Moreover, the Am29331 has several 
built-in features which enable it to res- 
pond to external stimuli with minimum 
latency. The sequencer can perform a 
16-way branch, dependent on the simul- 
taneous occurances of four external test 
conditions. The Am29331 Microprogram 
Sequencer can also handle interrupts or 
traps at the micro-level. | 

The system ARM concept (Availabili- 
ty, Reliability and Maintainability) is 
becoming increasingly important. The 
Am29300 addresses the problem of fault 
detection at the device level by a combina- 
tion of two techniques — parity and 
master/slave. Parity at the byte level is 
generated on the 32-bit result bus of the 
data path elements, stored in the Am29334 
Register File, and checked again going into 
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any of the operand buses of the data path 
elements. Thus any interconnection failure 
in the data bus can be detected. The choice 
of even parity scheme also allows detec- 
tion of an open TTL bus which defaults to 
high impedance all “ones” state, an error 
condition. For functional verification, a 
master/slave mode of operation permits 
two units to be connected in parallel, with 
one unit actually performing the computa- 
tion and the other checking the results on 
a cycle by cycle basis. The slave unit 
therefore verifies correct operation of the 
master. In addition, the master unit checks 
its internal result with the data on the output 
bus to ensure that no other device is driving 
the external bus when it is not supposed 
tobe. Any fault detected can trigger an in- 
terrupt at the microinstruction level. Unlike 
previous redundant schemes, no specializ- 
ed software is required. No system degra- 
dation results from the communication be- 
tween the redundant functional units. This 
combination of parity checking and master/ 
slave operation, which uses cost-effective 
hardware, rather than expensive software, 
is the key to future redundant system de- 
sign. 

The functional and performance require- 
ments of a general purpose supermini- 
computer and a digital signal processor are 
vastly different. Yet with functional par- 
titioning and a simple three-bus archi- 
tecture, the Am29300 devices are suited 
to address the needs ofadiverse spectrum 
of applications. Figure 2 depicts an exam- 
ple of a microprogrammed supermini built 
out of Am29300 components. The data 
path consists of the Am29332 Integer Pro- 
cessor, the Am29323 Parallel Multiplier 
as an accelerator, and the Am29334 
Register File. In this configuration, address 
calculation and data computation are per- 
formed in series. Alternatively, the 
Am29334 can be paralleled to yield effec- 
tively a six-ported register file, allowing 
four read accesses and two write accesses 
per microcycle. Another Am29332 can be 
dedicated to perform address computation 
concurrent with the normal ALU execu- 
tion, sharing the register space. With a 70 
to 80 nsec microcycle time, a pro- 
cessor/controller subsystem capable of 
several times the performance ofa typical 
supermini can be built with the Am29300 
parts, occupying far less board space and 
dissipating significantly less power. 

Figure 3 is a block diagram of a small 
array processor using the Am29325. A 
high-speed multi-port memory is used to 
provide storage for operands such that they 
may be accessed in simultaneous pairs. 
These operands may originate in the data 
memory, or may be intermediate results 


DECEMBER 1984 &@ DIGITAL DESIGN 





Controt 
Address 
Data 


32 
Instruction 
Queue 
64 


Instruction 
Decode 


Control Store 


Dm~TN-OMVWM OZyYDMVOEsL 


from the processor. One of these operands 
may be replaced with a value drawn from 
a non-volatile coefficient store. 

The array processor is microprogram 
controlled, with memory addresses being 
derived directly from the microcode. This 
is probably inefficient for large programs, 
and some form of microprogammed ad- 
dress generator would need to be added. 
The interface to the host processor is 
deliberately undefined, as this is user 
dependent. 

As abenchmark, this processor can per- 
form FFT butterflies in the canonical time 
of 10 cycles. Ata 100 nsec cycle time, this 
permits one butterfly every | zsec, or a 
1024-pt complex transform in 5.12 msec. 
A simple modification to the architecture 
allows a second Am2932S to be incorpo- 
rated to give a complex arithmetic pro- 
cessor. This doubles the throughput for 
the FFT, reducing the computation for 
the 1024-pt transform to 2.56 msec. 

While the Am29325 only provides sin- 
gle-precision floating point operation, 
the Am29300 family also provides bus- 
compatible devices which may be used to 
enhance the capabilities of the array pro- 
cessor described. The Am29332 Integer 
Processor offers a wide range of arithme- 
tic, logic and shift facilities. This device 
may be operated with a reduced width 
data path, allowing words of 1 to 4 bytes. 
The internal architecture is designed for 
efficient programming of floating point 
operations, and may therefore be used to 
support the Am29325 with double-preci- 
sion operations. To assist in double-pre- 
cision floating point multiplication, or 
for integer multiplication, the Am29323 
32-bit Multiplier provides 32 x 32-bit 
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multiplication in a single cycle, and has 
internal facilities for multi-cycle expan- 
sion to 128 X 128. 

These additional arithmetic elements 
have the same 32-bit, three-bus architec- 
ture as the Am29325. This allows them to 
be added in parallel. The routing of 
operands to the appropriate arithmetic 
element is a simple microcode task. 

The horizontal partitioning of this new 
family of parts has resulted in a number 


Figure 2: Am29300-based supermini emu- 
lation. 





of benefits. First, the user gains the flex- 
ibility of adding storage elements to two 
uncommitted output buses from the pro- 
cessor. Second, more power budget is 
available for the register file making it 
faster and bigger than if it had been in the 
processor chip. The family addresses a 
number of crucial system issues such as 
fault detection, support of high-level 
languages in systems programming and 
large register file-based architectures like 
RISC. DD 


References: 


32-Bit Building Blocks for High Per- 
formance Processor/Controller, Paul 
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A Very High Speed Floating Point Pro- 
cessor, B.J. New. Advanced Micro 
Devices, Sunnyvale, CA. 
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Figure 3: Am29300-based array processor. 
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Microprogrammable chips 
blend top performance 
with 32-bit structures 


Broken down into 32-bit functional blocks instead 
of being sliced into multiple-bit sections, 
five VLSI bipolar chips match a supermini’s speed. 


rz esigners of systems and subsystems for 
EE Dpiiaihsspecd computation, intelligent 

z=” peripheral control, and array and digital 
signal processing typically need higher per- 
formance than standard microcomputer parts 
can deliver. The required precision, speed, and 
virtual memory support has to some degree 
been supplied by dedicated VLSI components 
that are customized for particular applications. 
Yet an overwhelming need still remains for a 
set of building blocks that can bring extremely 
high performance to a large assortment of ap- 
plications. 

A new approach extends the bit-slice concept 
to 32 bits and also satisfies system designs that 
require cycle times of less than 100 ns. Witha 
family of five VLSI chips, designers of micro- 
programmed systems can count on cycle times 


of 70 to 80 ns, using merely a handful of com- 
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ponents. The building blocks for 32-bit systems 
functionally partition the chips and separate 
the register file from the rest of the data path. 

The following two articles first explore the 
key members of the Am29300 family and then 
focus on a floating-point processor, which is the 
first chip scheduled for sampling. Details are 


‘given on how to use the chip and other devices in 


the series to build a fast Fourier transform 
computer, as well as more general-purpose dig- 
ital signal-processing circuits. 

The Am29300 family addresses the problem 
of fault detection through an interlocking 
checking scheme—parity and master-slave. 
Byte parity is generated, stored, and then 
checked on all data-path elements as a means of 
detecting interconnection failures. Moreover, 
to verify certain functions, the master-slave 
operating mode permits two units to be 
connected in parallel, with one unit actually 
handling the computation and the other check- 
ing the result cycle by cycle. 

Detecting a fault triggers an interrupt at the 
microinstruction level. Unlike previous redun- 
dant schemes, no specialized software is re- 
quired. Furthermore, communication among 
the redundant functional units causes no 
system degradation. 

The five chips form a strong foundation for 
any system designer’s work. For instance, a 
16-bit sequencer can handle interrupts and 
traps at the microinstruction level. There is 
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also acombined ALU and shifter that inter- 
nally supports variable byte and bit fields. To- 
gether with the ALU-shifter chip, a true dual- 
port register file, organized as 64 words by 18 
bits, can build a basic system. The register file, 
designed for simultaneous read and write ac- 
cesses, is separated from the data-path ele- 
ments, thereby avoiding the problem of ad- 
dressing an internal register file differently 
from external memory. The benefits of that 
separation are uniform register addressing and 
unlimited depth expansion. 

Two accelerator chips—a floating-point 
processor and a parallel multiplier—can be 
added to the basic system to raise the number of 
functions and cut processing time. The 32-by- 
32-bit parallel multiplier can, on successive cy- 
cles, expand to 64 by 64 or 128 by 128 bits, with- 


out help from external logic. For its part, the 
math chip can tackle single-cycle addition, mul- 
tiplication, subtraction, and conversions—all 
in single-precision IEEE or DEC formats. 

Because of functional partitioning, a three- 
bus flow-through architecture was chosen as 
the data path. For maximum bus accessibility, 
all data-path elements—the integer processor 
and the parallel multiplier, for example—share 
two operand and one result bus. The flow- 
through architecture not only transfers data 
extremely quickly but also avoids the complex 
timing contro] needed to turn around bidirec- 
tional buses. Above all, the simplicity of the 
three-bus architecture allows these com- 
ponents to be configured in a variety of ways to 
optimize micro-architectures for different 
jobs. 


Bipolar building blocks 
deliver supermini speed 
to microcoded systems 


performance of bipolar circuits, bipolar 
technology is taking the next step to 
keep itself in the lead for the highest speed 
systems. A family of five bipolar VLSI com- 
putational circuits—fabricated with a scaled, 


f\ sCMOS processes start to encroach on the 
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ion-implanted, oxide-isolated processand three 
levels of metal interconnections for high den- 
sity—provides a set of functionally partitioned 
microprogrammable VLSI building blocks for 
systems such as superminicomputers, digital 
signal processors, high-speed controllers, and 
many others. The modularity of the system 
functions ensures that the chips can meet the 
performance requirements of a general- 
purpose superminicomputer, as well as those of 
an image processor, which are radically differ- 
ent from each other. 

Included in the family are three parts that 
form the core of a general-purpose micro- 
programmed system: a 32-bit arithmetic and 
logic unit (ALU), a 16-bit microprogram 
sequencer, and a 64-by-18 four-port, dual- 
access RAM. And, for systems that do a large 
number of multiplications or floating-point 
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operations, two performance accelerators—a 
32-by-32-bit multiplier and a 32-bit floating- 
point processor will be available to tie onto the 
buses (see Design Entry, p. 246). 

The chips offer high performance, a flexible 
architecture, and microprogrammability, and 
even address the problem of fault detection for 
data integrity. These circuits can thus support 
an extremely fast microcycle—about 80 ns 
(projected). That high speed is the result of 
several design considerations: Each part is de- 
signed internally with emitter-coupled logic 
but has TTL-compatible inputs and outputs. 
Second, more power was allocated to the logic 
circuits used in the critical paths than for logic 
in the noncritical paths on each chip, to max- 
imize the speed. Third, by integrating highly 
specialized logic on chip it is possible to execute 
very complex operations in a single cycle. 

The microprogrammability of this chip set 
offers several benefits to the system designer. 
It provides a structured and systematic ap- 
proach for implementing the control mech- 
anism of the system, and like the bit slices, it al- 
lows the instruction set to be customized to suit 
the designer’s application (see “Architectural 
Limitations of Bit Slices,” opposite). And 
several versions of the initial design can be 
tested, or current designs can be enhanced 
simply by changing the microcode. 

Thus, the functionally partitioned Am29300 
family overcomes all of the performance penal- 
ties of bit-slice structures, while maintaining 
its ability to form a wide variety of architec- 
tures. Even though the chips are designed to 
work together as a family, each can also be used 
independently in an application that requires 
its unique capabilities. 


Pipelines are out 


The flexibility of the Am29300 family is 
largely due to a decision not to place pipeline 
stages within the functional blocks. Not includ- 
ing the pipeline registers inside incurs some 
off-chip delays. This is a small price to pay to al- 
low system designers to optimize the pipeline 
structure for their individual needs. Moving the 
register file out of the functional block for the 
ALU also slows things down. At the same time 
it does not force a fixed register size on the user, 
enabling systems to be created with dedicated 
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registers, register windows, or register banks— 
all with neither fixed depth nor width. 
Additionally, the high level of integration 
helps eliminate the propagation delays often 
encountered when signals must go from chip to 
chip. The use of VLSI also results in fewer parts 
at the system level, which, in turn, conserves 
power (usually many watts in the case of bi- 
polar systems) and board space. Lastly, a com- 
plete 32-bit solution is provided for applications 
that require increased precision for arithmetic 
operations, high memory bandwidth, anda 


Architectural limitations 
of bit slices 


The limited performance of bit-slice circuits can 
be improved by increasing the width of the slices. 
That higher level of integration results in higher 
performance by reducing the number of off-chip 
delays while preserving the flexibility that has 
made bit-slice systems so attractive. However, as 
higher levels of integration become possible, two 
inherent problems with bit-slice architectures 
will limit their ultimate speed. The first involves 
the off-chip delays inherent in cascading. For ex- 
ample, the carry chain is usually the slowest path 
of an ALU. Breaking this chain between slices in- 
troduces off-chip delays into the critical path. 

The second problem is that the functional needs 
of many systems do not slice well. Barrel shifters 
and prioritizers are especially difficult to cascade. 
Unfortunately, the ability to perform N-bit shifts 
and locate the position of leading 1s are of greatest 
importance in applications that require heavy 
number crunching and manipulation of data 
fields, such as image processing, graphics, data- 
base management, and controllers. These are pre- 
cisely the applications whose need for speed forces 
the use of bit-slice devices. The system per- 
formance is compromised not only because these 
operations must be done bit by bit, but also be- 
cause many high speed algorithms cannot be effi- 
ciently implemented. 
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large addressing capability (4 billion bytes) to 
support virtual memory systems (Fig. 1). 

The performance of a system depends, not 
just onitsraw computings sneed but on its abili- 
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ty to respond to events such as interrupts and 
traps. For example, the Am293381 sequencer re- 
sponds to both interrupts and traps at the mi- 
croprogram level very quickly, and its response 
is completely transparent to the interrupted 
microroutine. Also, the Am29332 ALU indirect- 
ly supports the handling of these events by al- 
lowing its internal state to be saved or restored. 

The Am293832, a noncascadable 32-bit-wide, 
ALU, provides fast number crunching, high 
data transfer rates, and powerful bit-manip- 
ulation capabilities. Intended to be used with 
the Am29334 dual-ported RAM, which serves 
as an external register file, the ALU has two 
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Instruction 
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82-bit input buses (DA and DB) and one 32-bit 
output bus (Y). 
Internally, the device has a 32-bit data path 


that interconnects its yarious functional 


blocks. These blocks include various shifters 
and multiplexers, a mask generator, a funnel 
shifter, the ALU proper, a priority encoder, a 
parity generator and checker, a master-slave 
comparator, and the status and Q registers 
(Fig. 2). The ALU proper has three 32-bit in- 
puts: R, S and M. The R input comes from the 
funnel shifter, the M input from the mask gen- 
erator, and the S input from a variety of sources 
—the DA or DB buses, status register, or the Q 
register. 

The power and flexibility of the Am29332 
comes partly from its ability to perform oper- 
ations on various data types. It can operate on 
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1. A conventional CPU, built with Am29300 building blocks, forms the focal point of an 
extremely compact system that cycles as fast as 80 ns. 
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variable bytes, variable-length bit fields, or sin- 
gle bits. This is made possible by the internal 
mask generator, which creates a 32-bit mask 
for each instruction (with no time overhead). 
The mask is used as an additional operand in 
each instruction to allow the operation on only 
selected data widths. 

The type of mask generated depends on the 
type of instruction. For instructions that oper- 
ate on variable bytes (1, 2, 3 or 4 bytes) the mask 
is a fence of 1s (bit 0 aligned) for all low-order 
selected bytes with a fence of 0s for all high- 
order unselected bytes. Instructions that oper- 
ate on variable-length bit fields require a mask 
that is a string of contiguous 1s for all selected 
bit positions and 0s for all unselected bit posi- 
tions. In cases where the field exceeds the 32-bit 
boundary, the mask does not wrap around, thus 
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allowing operation on a contiguous field across 
a word boundary. For instructions that operate 
on asingle bit, the mask isa 1 for the selected bit 
position and 0s for the other unselected bits. 

For most single-operand instructions, the 
unselected bit positions pass the corresponding 
bits of the operand unmodified. For most two- 
operand instructions, the unselected bit posi- 
tions pass the corresponding bits of the operand 
unmodified on the DB input. Thus, for two- 
operand instructions the mask allows the 
merging of two operands in asingle cycle. In ad- 
dition to being used internally, the mask can be 
sent out over the Y bus, permitting the gener- 
ator to be used as a pattern generator for test- 
ing purposes. 

To speed various mathematical and logical 
operations, many circuits have started to in- 
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2. To connect its various internal functional blocks, the Am29332 ALU 
employs a 32-bit bus. Among the chip’s major features are a 64-bit fun- 
nel shifter, parity checking and generation, and a basic 32-bit ALU that 
has three input ports. The processor also has three 32-bit ports through 


which it transfers data into and out of the chip. 
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clude a barrel shifter, which has an N-bit input 
and an N-bit output. The barrel shifter would 
be used to shift or rotate the operand either up 
or down from 0 to N bits in a single cycle. Such 
high-speed shifting is very useful in operations 
such as the normalization of a mantissa for 
floating-point arithmetic or in applications in 
which the packing and unpacking of data are 
frequent operations. 

However, a more useful circuit is a funnel 
shifter, which can be thought of as having two 
N-bit inputs and one N-bit output. Just sucha 
circuit (with 32-bit-wide ports) was included on 
the 29332. The circuit can perform all the oper- 
ations of a barrel shifter with capabilities ex- 
tended to two operands instead of one. In addi- 
tion, it can extract a 32-bit contiguous field 
across its two operands, a function very useful 
in several graphics applications. And any of its 
operations can be followed by a logical oper- 
ation, with both completed in a single cycle. 


Setting the priorities 


Prioritization, useful to control N-way 
branches, perform normalizations, and in 
graphic operations such as polygon fills, can 
readily be handled by the ALU chip. The built- 
in priority encoder sends out a 5-bit binary 
weighted code that signifies the relative posi- 
tion of the most-significant 1 from the most- 
significant bit position of the byte width se- 
lected. That allows prioritization on either 8-, 
16-, 24-, or 32-bit operands. The priority encoder 
output can be passed on to the Y bus or stored in 
the status register. 

If, for example, prioritization is used to nor- 
malize a mantissa during a floating-point 
arithmetic operation, it requires two cycles. In 
the first, the mantissa is prioritized to deter- 
mine the number of leading 0s that need to be 
stripped off. In the next cycle, the mantissa is 
shifted up by the amount specified by the prior- 
ity encoder output. 

Relevant information for each operation per- 
formed by the chip is stored in the 32-bit status 
register after each microcycle. Each byte of the 
status word holds different information. The 
least-significant byte holds the position spec- 
ifier. The next most-significant byte holds the 
width specifier and three other bits that are 
used to test the comparison of unsigned and 
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signed operands. The next byte contains the 
Carry, Negative, Overflow, Link, Zero, M andS 
flags. The M flag stores the multiplier bit for 
multiply or the sign compare bit for signed di- 
vision, and the § flag stores the sign of the par- 
tial remainder for unsigned division. The most 
significant byte stores the nibble carries for 
BCD operations. 

The states of the Carry, Negative, Overflow, 
Link and Zero flags are available on the status 
pins, and the status multiplexer allows the user 
to select either the status of the previous in- 
struction (register status) or the status of the 
current instruction (raw status) to appear on 
the status pins. The raw status could be used to 
update an external macro status register. This 
also allows branching at.either the micro- or 
macro-level. 

The Q shifter and Q register are primarily 
used to assemble the partial product or partial 
quotient in multiplication and division oper- 
ations. Variable bytes of the status and Q reg- 
ister can either be loaded via the DA and DB 
inputs or can be read over the Y bus. Thus sav- 
ing and restoring of the registers allows effi- 
cient interrupt handling after any microcycle. 
It is also possible to inhibit the update of both 
these registers by asserting the Hold pin. 


Powerful and orthogonal instructions 


The power of the ALU chip’s instruction set 
comes directly from the integration of several 
functional blocks mentioned earlier. The com- 
mands are symmetrical as well as orthogonal, 
to make it easier for a compiler to generate effi- 
cient code. Thus, any operation on the DA input 
is also possible on the DB input, and each in- 
struction is completely independent of its data 
type. 

Three-fourths of the instruction set consists 
of variable byte-width (one, two, three or four) 
operand instructions. The byte-width is se- 
lected by two bits in the instruction. For these 
operands, the instruction set supports all con- 
ventional arithmetic, logical and shift oper- 
ations. Arithmetic operations can be per- 
formed on both signed and unsigned binary 
integers. 

Additionally, the instruction set supports 
multiprecision arithmetic such as addition 
with carrying and subtraction with carrying or 
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borrowing. For all subtract operations it pro- 
vides the convenience of using borrowing in- 
stead of carrying by asserting the borrow pin. 
In this mode the carry flag is updated with the 
true Borrow. To allow efficient execution of 
macroinstructions the chip contains a Macro 
mode pin. When the chip asserts this pin, it al- 
lows the external Macro-Carry and Macro-Link 
bits instead of their microcounterparts to part- 
icipate in the operation. 

Instructions that execute algorithms for the 
multiplication and division of signed and un- 
signed integers are multiple cycles are also pro- 
vided. For multiplication, the circuit supports 
the modified Booth algorithm, yielding two 
product bits in one cycle. Both single-precision 
and multiprecision division of signed and un- 
signed integers are supported at the rate of one 
quotient bit in every cycle. 

Besides binary integers the instruction set 
provides basic arithmetic operations for 
binary-coded decimal (BCD) numbers. By oper- 
ating directly on the decimal numbers created 


Device X 


3. To help ensure system integrity, two Am29332 
Processors can be set for master and slave oper- 
ation. Both chips perform the same operation in par- 
alle!, and any difference in their results is flagged as 
an error. The master also checks its interna! result 
against the data on the output bus to make sure 
that no other device (such as device X) is turned on 
at the same time. 
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in most business applications, significant pro- 
cessing time is saved by eliminating the need to 
convert from binary to BCD and vice versa. 
Also, the round-off errors involved in con- 
verting from one base to the other are elimi- 
nated. 

The last group of instructions was created to 
support variable-length bit fields (1 to 32) and 
single-bit operands. The position and width of 
the field can be specified by either the position 
and width inputs or by fields in the status reg- 
ister, thereby saving bits in the microcode. 
Most of the time, the position and width are 
determined dynamically. It is therefore diffi- 
cult to supply them via the microinstructions. 
For single bit operations only the position spec- 
ifier is needed. 

Bit-manipulation instructions include set- 
ting, resetting, or extracting a single bit of the 
operand or the status register. Logical oper- 
ations on either aligned or nonaligned fields in 
the two operands include OR, AND, NOT and 
XOR. In the case of nonaligned fields it is as- 
sumed that at least one of the fields is aligned to 
bit position 0. It is also possible to extract a field 
from one operand and insert it into another 
operand or extract a field across two operands. 


Enhancing system integrity 


The growing need for data integrity has been 
addressed at both the system and the chip level 
by including hardware for fault detection. Dur- 
ing calculations, byte-wide even parity is gener- 
ated for the data result by the ALU and stored 
with the data in the external RAM. Byte-wide 
even parity is also checked at the ALU inputs 
and any error is flagged. 

Even parity is specifically used to check for a 
floating TTL bus. Thus, all interchip connec- 
tions are checked out. In addition, hardware for 
functional verification is also provided on the 
sequencer and the ALU functional verification 
can be implemented by using two similar de- 
vices in the master and slave mode (Fig. 3). In 
that setup, both chips perform the same oper- 
ation, with any difference in their outputs being 
flagged as an error. The slave-mode chip’s bidi- 
rectional buses operate in their input mode, al- 
lowing the master to compare its own internal 
result with that of the slave on every cycle. Ad- 
ditionally, the master checks the output bus to 
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make sure that no other device is turned on at 
the same time. 

As mentioned earlier, the ALU architecture 
was designed to use an external register file. 
Keeping the file external to the chip permits the 
user to expand it to meet any system need. The 
Am29334, a high-speed 64-word-by-18-bit dual- 
access RAM, provides two independent data in- 
put ports and two independent data output 
ports (Fig. 4). Each port can be read from or 
written to using the separate inputs and out- 
puts. The two accesses are independent except 
for the case when simultaneous write opera- 
tions are done to the same word—in which case 
the result is undefined. The read address inputs 
and the write address inputs of each side are se- 


Am29334 


dual-port 


RAM 
{64 X 18 bits) 


4. The dual-access RAM serves as an external reg- 
ister file for the arithmetic processor chip. The 
Am29334 holds 64 words, each 18 bits long. Two 
chips are often connected to build a RAM block with 
four data outputs, two data inputs, and six address 
lines. Each port of the RAM can be independently 
accessed to read or write. 
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parate in order to save the cost and time delay 
of external multiplexing between a read ad- 
dress and a write address. 

The word width of 18 bits allows the RAM to 
store two bytes plus a parity bit for each. Each 
side has separate write enable for the lower and 
upper nine-bit bytes and a common write en- 
able that also switches the address multiplexer. 
The actual write is delayed internally to allow 
the write address to set up internally before 
writing starts. 

It is possible to build a RAM with four data 
outputs, two data inputs and six addresses by 
using two dual-access RAMs and on each side 
connecting the data input, write address and 
write enables of one RAM in parallel with the 
corresponding inputs of the other RAM. This 
expanded RAM may be used in concurrent pro- 
cessing applications in which an ALU and an 
adder (which generates the address) do their 
computations—this yields a result and an ad- 
dress in parallel. The two values can then be fed 
simultaneously to the multiport memory. 


The sequencer controls the show 


The cycle time of the microprogrammed sys- 
tem is dependent on both the control path (i.e., 
sequencer and microprogram memory) and the 
data path (i.e., register file and ALU). Tradi- 
tionally, the system bottleneck has been the 
control path, especially the ciritical paths asso- 
ciated with conditional branching. Special care 
has been taken in the design of the Am29300 
family to balance control and data-path timing. 

A key device contributing to the improved 
control-path timing is the Am29331 16-bit mi- 
croprogram sequencer. It is designed for high 
speed, and that speed has been attained by the 
elimination of functions that would slow down 
the microaddress selection and by including the 
test logic and the test multiplexer in the se- 
quencer (Fig.5). Asin most previous generation 
sequencers, the address register, the incre- 
menter, the address multiplexer, the stack, and 
the counter are standard functions. The se- 
quencer has multiway branch instructions that 
allow 1 of 16 consecutive addresses to be se- 
lected as the branch target in a single cycle. 

The address register in most other sequen- 
cers is called a program counter, but this name 
is not correct if a strict definition is applied. In 
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the Am29331, the incrementing counter is 
placed after the address register, which thus al- 
lows for the handling of traps. The stack stores 
return addresses, loop addresses and loop 
counts. It has 33 levels to permit the deep nest- 
ing of subroutines, loops and interrupts. An 
output, Almost Full (A-Full), indicates when 28 
or more of the levels are in use. 

Available for use in iterative loops, the 
counter can be loaded with an iteration count at 
the beginning of a loop, and the count is tested 
and then decremented at the end of the loop. 


4 


The loop is terminated if the count is equal to 
one; otherwise a jump to the beginning of the 
loop is executed. 

There are three buses that carry microad- 
dresses. The bidirectional D bus can be con- 
nected to the pipeline register, providing 
branch addresses or loop counts, or used for 
two-way communication with the data process- 
ing part of the system. The A bus, called an al- 
ternate bus, can be connected to a mapping 
PROM to provide starting microaddresses for 
instructions in a computer. The Y bus sends out 


Multiplexer 


L Stack 
pointer 


Interrupt 
bh return 

address 

register 





5. To aid in handling trap operations, the incrementer is placed after the address 
register in the Am29331 microsequencer. Additionally, the chip has a 16-bit ad- 
dress bus, which enables it to access up to 64 kwords of control memory and han- 
dle interrupts and multiple-path branches. 
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selected microaddresses to the microprogram 
memory and accepts interrupt or trap address- 
es if interrupt or trap is employed. 

Four sets of 4-bit multiway inputs provide a 
simultaneous test capability of up to 4 bits. 
And, one way to use those inputs would be to 
decode mode bits in changing positions in mac- 
roinstructions. The four select lines select 1 
of 16 tests to be used in conditional instructions. 
There are twelve test inputs. Four of these may 
be used for C (Carry), N (Negative), V (Over- 
flow) and Z (Zero), generating internally the 
tests C+Z,C + Z, N XOR V, and N XOR V+Z, 
which are used for comparison of signed and 
unsigned numbers. 

Relative addressing was the only somewhat 
useful function that was removed in order to 
maximize speed. The sequencer supports inter- 
rupts and traps with single-level pipelining, but 
may also be used with two levels of pipelining in 
the control path. It has a 16-bit-wide address 
path and cannot be cascaded, which thus limits 
the addressable memory depth to 64 kwords of 
microcode. That, however, is sufficient for the 
vast majority of applications—a typical 
computer, for instance, that has a micropro- 
grammed instruction set, might use only about 
1 to 2 kwords. However, for systems in which 
the microprogram is the sole program level, its 
size is generally larger. 


Microprogram interrupts supported 


The Am29331 sequencer supports interrupts 
at the microprogram level. Like polling, inter- 
rupts handle asynchronous events. However, 
polling requires explicit tests in the micro- 
program for events, thus leading to long re- 
sponse times, lower throughput, and larger mi- 
croprograms. Interrupts, on the other hand, 
have a response time equal to the cycle time of 
the system (approximately 80 ns), measured 
from the Interrupt Request input (INTR). The 
sequencer accepts interrupts at every micro- 
instruction boundary when the Interrupt En- 
able input (INTEN) is asserted. 

An actual interrupt turns off the Y bus driver 
and asserts the Interrupt Acknowledge output 
(INTA), which should be used to enable an ex- 
ternal interrupt address onto the Y bus, thus 
driving the microprogram memory. The inter- 
rupt also causes the interrupt return address to 
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be saved on the stack; this permits nested inter- 
rupts to be handled (Fig. 6). 

The Am29331 is also the first sequencer that 
can handle traps. A trap is an unexpected situa- 
tion caused by the current microinstruction, 
which must be handled before the microin- 
struction completes and changes the state of 
the system. An attempt to read a word from 
memory across a word boundary in a single cy- 
cle is an example of such a situation. When a 
trap occurs, the current microinstruction must 
be aborted and re-executed after the execution 
of a trap routine, which will take corrective 
measures. 

Execution of a trap requires that the se- 
quencer ignore the current microinstruction 
and push the trap return address—the address 
of the ignored microinstruction—on the stack. 
The trap address must be transferred onto the 
Y bus at the same time. All this can be accom- 
plished by disabling the carry-in to the incre- 
menter (C;,,) and asserting the Force Continue 
input (FC) and the Interrupt Request input 
(INTR). 

Also built into the sequencer is an address 
comparator, which allows detection of break- 
point in the microprogram. An output signal 
from the comparator indicates when the con- 
tent of the comparator register is equal to the 
address on the Y bus. There is an instruction 
that loads the comparator register from the D 
bus and enables the comparator, which may lat- 
er be disabled by another instruction. 

Parallel microprocesses are useful when the 
system must deal with peripheral devices that 
are controlled at the microcode level. Normally 
only one processor is present and it must be 
time multiplexed between the concurrent oper- 
ations that must be performed. When a process 
is suspended its private state must be saved, so 
that it can be restored when the process re- 
sumes execution. That, in turn, requires that 
the state of the sequencer be saved and re- 
stored, or each process must have its own 
sequencer that is active when the associated 
process is active. The first approach is the least 
expensive, but the second offers the advantage 
of shorter response time, because no time is 
spent on saving and restoring the state. 

The Am29331 supports the first approach 
with its bidirectional D bus, through which the 
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entire state, with the exception of the com- 
parator register, can be saved and restored. The 
sequencer also supports the multiple sequencer 
arrangement, in which the three-state Y buses 
from the sequencers are tied together driving a 
single microprogram memory. One of the se- 
quencers is active, while the remaining sequen- 
cers are put on hold by asserting their Hold 
inputs. The Hold input disables most outputs 
(the D bus synchronously), disables the incre- 
menter, and enables an internal Force Con- 
tinue. This effectively detaches the sequencer 


A :CALLC 
Atos 

B  : CONTINUE 
B+t... 


Co. aksur Executing at A 


from the system and preserves its state. 

The sequencer has a 6-bit instruction input 
that is internally decoded to yield a set of 64 in- 
structions. There are 16 basic branch instruc- 
tions, each in an unconditional version, a condi- 
tional version, and a conditional version with 
complemented test. In addition there are 16 
special instructions like Continue and Push C 
(push counter on stack). The branching instruc- 
tions handle jumps, subroutines, various kinds 
of loops and exits out of loops, and FC actually 
overrides the instruction inputs with acontinue 


Executing at B 


Interrupt return 
address register 


Multiplexer 


B+1 


6. Because it can accept interrupts at any micrainstruction boundary, the sequencer responds faster than 
most other microprogrammed systems. For example, while the instruction at point A in memory is being 
executed, the sequencer is directed to point B. The only restriction on the programmer is that the first in- 
struction of the interrupt routine cannot use the stack, since the interrupt return address is pushed onto it at 


the start of the procedure. 
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instruction. FC is useful in field sharing and 
support for writable microprogram memory. 
The Am29331 is one of the few sequencers 
where the stack is accessible from outside 
through the bidirectional D bus. This indirectly 
allows access to the whole state of the se- 
quencer except the comparator register. This is 
useful when testing the device, and during 


system debugging, in which, for example, the 
contents of the counter and the stack may be 
examined and altered. By including the trou- 


“pleshooting instructions in the microcode, the 


sequencer may aid in debugging itself and the 
rest of the system. The access to the state is also 
useful for changing context or extending the 
stack outside.O 


Single-chip accelerators 
speed floating-point 
and binary computations 


omplex multiplication or floating-point 
mathematical operations are frequently 
needed in most computer systems, but in 
many cases, not often enough to warrant the 
added cost of dedicating CPU hardware to the 
computational job. To speed up the calcu- 
lations, many systems, though, allow for accel- 
erator boards or boxes that can perform such 
operations at several megahertz speeds or 
more. 

Already, many silicon designers have devel- 
oped chips to simplify the design of such sub- 
systems— 16-bit parallel multipliers fabricated 
in bipolar, CMOS or NMOS processes, and 
single-chip or multichip floating-point pro- 
cessors made with CMOS or NMOS have been 
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available for some time. However, they are low- 
performance solutions to the problem, or in 
some cases, have limited application since they 
are intended for highly pipelined systems. 

Now, the ability to handle 32-bit binary mul- 
tiplication or 32-bit floating-point multiplica- 
tion, addition or subtraction can be added toa 
system with just a single chip. The Am29323 isa 
32-bit parallel multiplier that accepts two 
32-bit inputs and can deliver a 64-bit product in 
a single clock cycle of 80 ns. Alternatively, per- 
forming floating-point operations, the 
Am29825 accepts two 32-bit inputs and delivers 
a 82-bit result in less than 125 ns. It can operate 
with numbers represented in either the IEEE 
(P754) or Digital Equipment Corp. floating- 
point formats and can convert numbers from 
one format into the other. 

Both chips are part of the just unveiled 
Am29300 series of 32-bit computational ele- 
ments (Design Entry, p. 230). The multiplier is 
ideal for computer systems that do floating- 
point operations only infrequently but must of- 
ten perform high-speed integer calculations 
such as those required in image manipulation. 
The floating-point processor enhances systems 
used for fast Fourier transform and scientific 
calculations. Systems could even contain both 
accelerators if a high-performance, general- 





purpose system were built (Fig. 1). 

To speed the flow of data into and out of the 
chips, both circuits were designed with two 
32-bit-wide input ports and one 32-bit output 
port. But the similarities end there, since the 
chips perform vastly different operations on 
the data. A fairly straightforward design, the 
multiplier uses a full Booth-encoded array to 
deliver a 64-bit product to the output register 
(Fig. 2). The output register feeds a multiplexer 
that sends the result, 32 bits at a time, to the 
output port. 

Double-precision operations can be done 
thanks to dual 32-bit input registers that are 
multiplexed into the multiplier array. A 67-bit 
partial-product adder allows new products to 
be summed with the contents of the output reg- 
ister. During this operation, the contents of the 
output register may be scaled by 32 bits, if nec- 
essary. Four partial products are formed and 
summed, and a temporary register assists in 
the scheduling of output transfers. The effec- 
tive pipelining throughput in the double- 
precision mode is one 64-bit multiplication 
every four cycles. The accumulator can also 
support 96- and 128-bit multiplications. How- 
ever, for such operations, input data must be 
repeatedly applied. 

The input and output registers of the multi- 
plier have independent control signals so that 
they can be optimally timed in pipelined 
systems. However, in unpipelined systems, the 
registers can independently be made “trans- 
parent” so that data encounters no delays when 
entering or leaving the chip. Like the other 
chipsin the Am29300 family, the multiplier has 
parity checking and generating circuits to en- 
sure system data integrity. And, the circuit of- 
fers a slave mode in addition to its normal 
mode—if two chips are tied together to operate 
in parallel with one set to operate in the slave 
mode, the circuits will generate an error flag if 
unequal results are obtained. 

In the world of floating-point computations, 
several single-chip units, designed to be gen- 
eral-purpose math coprocessors for micro- 
processor systems have achieved close to micro- 
second operating speeds. However, to achieve 
higher throughput rates, several recently an- 
nounced two-chip sets have cut that speed by a 
factor of 10, achieving data throughput rates of 
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10 MHz for pipelined operations. But, if oper- 
ated in nonpipelined systems, these chips lose 
considerable speed—often by a factor or two or 
three—since data must ripple through the 
stages of pipeline registers. 

Tocut the data delays, the Am29325 took a di- 
rect approach and eliminated all the pipelining. 
It is the first floating-point processor to contain 
a 32-bit floating-point adder/subtractor, mul- 
tiplier, and flexible 32-bit wide data path ona 
single chip (Fig.3). Additionally, support for di- 
vision operations is included on the chip as well 
as a status flag generator. 

Fabricated with the IMOX-S bipolar process 
and three levels of metal interconnections and 


4-port 
register 
file 
(Two Am29334s) 


Floating-point 
processor 
(Am29325) 


32-by-32-bit 
multiplier 
(Am29323) 


ALU 
(Am29332) 


1. The 32-bit multiplier and the 32-bit floating-point 
processor can be used together in a system. Either 
chip also functions without the other if just one of 
the capabilities is needed. 
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housed in a 144-lead pin-grid-array package, 
the Am29325 can replace one to two boards of 
SSI and MSI logic typically used in general- 
purpose computers, array processors and 
graphics engines, to provide high-speed float- 
ing-point math capability. When used in con- 


Parity error 


XA register 


XB register 


32 X 32-bit 
multiplier array 


Multiplexer 


Parity 


ie 7 


Hard error 
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2. Surrounding the 32-by-32-bit multiplier array on 
the Am29323 are multipliers for the two 32-bit input 
buses, which permit 64-bit multiplications to be 
done in just four cycles. The multiplier checks parity 
on the input data and generates parity bits for the 
output result. 
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cert, the on-chip functions will meet the com- 
putational and data-routing needs of these and 
many other applications. 

Integrating these functions into a single de- 
vice greatly reduces data routing problems and 
minimizes processing overhead that would 
otherwise be incurred when shuffling data on 
and off the chip. The interna] data path is 
ideally suited for multiplication and accumu- 
lation, Newton-Raphson division, polynomial 
evaluation, and other often-used arithmetic 
sequences. Placing the data path on chip also 
dramatically reduces the number of ICs needed 
to interface the device to the rest of the system. 

The three-port floating-point arithmetic 
unit at the chip’s core can perform any of eight 
instructions in a single clock cycle. The absence 
of pipeline delay in the arithmetic unit means 
that the result of an operation is available for 
use as an input operand in the very next oper- 
ation, a crucial feature when performing algo- 
rithms with tight feedback loops. Instructions 
and other operating modes are selected with 
dedicated input signals, an approach ideally 
suited to microprogrammed environments. The 
device easily interfaces with a variety of 16- and 
32-bit systems using one of three program- 
mable bus modes. 


Delving into the operation 


At the heart of the arithmetic unit area high- 
speed adder-subtracter, a 24-by-24-bit multi- 
plier, an exponent processor, and other logic 
needed to implement the floating-point 
operations. Two input ports, R and S, provide 
operands for the instruction to be performed; 
the result appears on port F. One of eight in- 
structions is selected by placing a 3-bit code on 
lines I, I,, and I,. The first three instructions— 
R+58,R-—S,andR X S—operate on both input 
operands; the remaining instructions need only 
one input operand. 

The fourth instruction, 2 — S, forms the core 
of the Newton-Raphson division algorithm, in 
which the quotient A/B is calculated by first 
evaluating 1/B, then postmultiplying by A. The 
reciprocal value 1/B is derived by using an ex- 
ternal lookup table to provide an approxima- 
tion of 1/B; this approximation is refined using 
the iterative equation: 


Xn = Xpn-1 (2>Bx(041), 
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where x, is the nth approximation of 1/B. 

Once B and the approximation of 1/B are 
loaded into the Am29325, the approximation is 
refined using a sequence of R X Sand 2 — Sin- 
structions; no additional I/O operations are 
needed for reciprocal refinement. The remain- 
ing four instructions perform data format con- 
versions. Instruction INT to FP converts a 
32-bit, two’s complement integer to floating- 
point form, useful when processing data initial- 
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3. Also using separate 32-bit buses for the inputs 
and output, the AM29325 floating-point processor 
handles either IEEE or DEC formatted data and can 
translate between formats, if necessary. 
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ly generated in fixed-point format; conversion 
from floating point to integer formatis handled 
by instruction FP to INT. Two other instruc- 
tions convert between IEEE and DEC floating- 
point formats. 

The arithmetic unit recognizes two single- 
precision floating-point formats—the IEEE 
format as specified in proposed standard P754, 
draft 10.0, or the DEC format used in VAX 
minicomputers. The eight instructions can be 
performed using either format; the desired 
format is selected with the IEEE/DEC pin on 
the processor chip. The formats are broadly 
similar—each has an 8-bit biased exponent, a 
24-bit significand comprising a 23-bit mantissa 
appended to an implied or “hidden” most- 
significant bit (MSB), and a sign bit. 

There are, however, a number of subtle dif- 
ferences. The IEEE format has an exponent 
bias of 127 and a binary point placed to the right 
of the hidden bit, while the DEC format has an 
exponent bias of 128 and a binary point placed 
to the left of the hidden bit—these variances re- 
sult in a slightly different range of represent- 
able values. Each format has its own set of 
operands reserved for special uses. The IEEE 
format reserves operands to represent non- 
numerical values (referred to as Not a Number, 
or NaN), +0, —©o, and plus and minus 0; the 
DEC format reserves only two types of oper- 
ands to represent non-numerical values and 0. 
In addition to format differences, there are a 
number of minor differences in the manner 
in which operands are handled during the 
course of a calculation. These differences are 
automatically accounted for when the desired 
format is selected. 


The need for rounding 


When performing a floating-point operation, 
it is sometimes possible to generate a result 
whose value cannot be precisely expressed as a 
floating-point number. If, for example, the 
single-precision floating-point values 2”* and 
2~' are added, the infinitely precise result, 2” 
+27', cannot be represented exactly in the 
single-precision floating-point format. Some 
means, then, must be provided for mapping the 
infinitely precise result of a calculation to a re- 
presentable floating point value. The arith- 
metic unit implements four IEEE-mandated 
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rounding modes to afford the user some flex- 
ibility when performing this mapping; the de- 
sired rounding mode is selected with signals 
RNDp-RND). 

Of the four modes, the round-to-even mode is 
most often used; it maps the infinitely precise 
result of an operation to the closest representa- 
ble floating-point value. The round-toward 
—co mode maps to the nearest representable 
value less than or equal to the infinitely precise 
result; similarly, the round to +°o mode maps 
to the nearest value greater than or equal to the 
infinitely precise result. A fourth mode, Round 
toward zero, maps to the closest representation 
whose magnitude is less than or equal to that of 
the infinitely precise result. As one would ex- 
pect, if the infinitely precise result of an oper- 
ation is representable in the floating-point 
format, it passes through the rounding oper- 
ating unchanged, regardless of rounding mode. 

As the result of an operation, various status 
flags are set or reset by the status flag gener- 
ator. Six flags are used to note the occurrence of 
overflow, underflow, zero, not-a-number, 
invalid, or inexact conditions. Because the flags 
are generated as the operation is performed, 
the user can greatly reduce processing over- 
head that would otherwise be needed to test the 
results of operations. The flags are fully de- 
coded, minimizing the amount of hardware 
needed to interpret them. 


Flagging the status 


Four of the status flags report exception con- 
ditions stipulated in IEEE standard P754. The 
Invalid flag indicates that an input operand or 
operands are invalid for the operation to be per- 
formed. The Underflow and Overflow flags are 
active when a result is too small or too large for 
the operation’s destination format. The fourth 
exception flag, Inexact, tells the user that the 
result of an operation is not infinitely precise. 
Although these flags are primarily an adjunct 
to operation in the IEEE format, they also pro- 
duce valid results when the DEC format is se- 
lected. The Am29325 generates two additional 
flags not provided for in the IEEE standard. 
Flags Zero and NaN identify zero-valued or 
nonnumerical results for both IEEE and DEC 
formats. 

A floating-point processor whose arithmetic 
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unit performs millions of operations per second 
can maintain that operating speed only if the 
correct operands can be routed to the arith- 
metic unit at that rate; if not, the specification 
is meaningless. To meet this crucial require- 
ment, the core of the Am293825 is supported by a 
32-bit data path comprising two input buses, a 
three-state output bus, and two data feedback 
paths. These data paths give the user the means 
to get the operands to where they are needed 
without devouring extra clock cycles. 

Data enters through input buses R)-R,, and 
So-S31; results exit through three-state output 
bus F)-F3,. Each bus hasa 82-bit edge-triggered 
register for data storage; data is stored on the 
rising edge of common clock input, CLK. An in- 
dependent clock enable is provided for each reg- 
ister, so that new data can be clocked in or old 
data held; the clock enables are well-suited toa 
microprogrammed environment, and make the 
gating of clocks, always a risky business, un- 
necessary. The ability to clock or hold any 
register is a powerful tool for performing algo- 
rithms with conditional operations, or algo- 
rithms in which intermediate results must be 
delayed for one or more cycles before reenter- 
ing the calculation. 

In many applications,the internal registers 
will be used to store input and output operands; 
it is in this register-to-register mode that the 
chip shows its top speed. Some users, however, 
may wish to bypass one or more of the internal 
registers. The input and output registers can be 
made transparent independently using feed- 
through controls FTO and FT1. If all three reg- 
isters are made transparent the device operates 
in a purely combinatorial “flow-through” 
mode. That mode, through, is somewhat slower 
than the register-to-register mode, but is useful 
in systems that need a register structure sub- 
stantially different from that provided in the 
Am29325, or in systems where floating point 
operations must be concatenated with other 
combinatorial functions. 

The two feedback data paths greatly simplify 
the task of moving data from one calculation to 
the next. One path routes data from the output 
of the arithmetic unit toa multiplexer at the in- 
put of register R; the multiplexer selects the 
operation result or Ro-Rs;. The result of any 
operation can therefore be loaded into register 
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R, register I", or both. The second path feeds the 
output of register F toa multiplexer at the 
arithmetic unit’s S port; the multiplexer selects 
either register S or register F as the port S in- 
put. This path effectively increases the number 
of commands—instruction R Plus S, for ex- 


_ - Input bus R 
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4. Three programmable I/O bus modes permit the 
floating-point processor to operate with dual 32-bit 
input buses (a), a single, shared 32-bit input bus (b), 
or even two 16-bit buses (c) so that it can easily 
connect to most 16-bit microprocessor systems. 
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ample, can also be performed as R Plus F. 

Thanks to the inclusion of three program- 
mable I/O modes, the circuit readily interfaces 
with both 16- and 32-bit sytems. The most 
straightforward of these options is the 32-bit, 
two-input bus mode (Fig. 4a). The advantage of 
this mode is its high I/O bandwidth—no multi- 
plexing of I/O buses is required, thus improving 
system speed and easing critical timing con- 
straints. R and S operands are taken from their 
respective buses and clocked into the Rand S$ 
registers on the rising edge of CLK; register F is 
also clocked on this transition. 

Another choice sets up a 32-bit, single-input 
bus, in which both the R and S buses are con- 
nected to a single input bus (Fig. 4b). The R and 
S operands are multiplexed onto this bus by the 
host system; the R register clocksits operand on 
the rising edge of CLK, the S register on the 
falling edge. The S operand is double-buffered 
on chip, so that the new S operand is presented 
to the arithmetic unit on the rising edge of CLK. 
Operation of register F and the F bus is the 
same as in the 32-bit, two-input bus mode. 

The last option has targeted 16-bit systems— 
a 16-bit, two-input bus mode (Fig. 4c). In this 
mode the R, S, and F buses are 16 bits wide; 
32-bit operands are placed on the buses by time- 
multiplexing the 16 MSBs and LSBs of each 
data word. The LSBs of the R and S operands 
are double-buffered on chip, so that the com- 
plete 32-bit operands are presented to the arith- 
metic unit on the rising edge of CLK. Internal 
data paths and registers remain 82 bits wide, 
thus giving the 16-bit system designer the be- 
nefits of the simple interface and the speed of 
the wide internal data paths. 


Putting the part through its paces 


Multiplication and accumulation—a combi- 
nation of operations very commonly used in 
digital filtering, image processing, matrix 
manipulation, and many other applications— 
can readily show the capability of the floating- 
point processor. In such a combination of 
operations, N input terms x; are multiplied by 
constants k;; the products are then added, pro- 
ducing the weighted sum: 

N-1 
$= D2 k; Xj 
i=0 

To do this with the Am29325 is a simple two- 
step process, with two additional steps for ini- 
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tialization. In the first step data and coefficient 
values xy and ky are clocked into registers R and 
S. During step two the values x, and ky are mul- 
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tiplied and the in register F; at 
the same time, data and coefficient values x, 
and k, are clocked into R and S. Third, values x, 
and k, are multiplied and the product placed in 
R. In step four, products x,k, and xky are added 
and the sum placed in F, and x, and k, are clock- 
ed into Rand S. 

The third and fourth steps are then repeated 
for as many iterations as needed to complete 
the operation. Once the part has been loaded 
with the first two sets of operands, the internal 
data path routes partial results to keep the 
arithmetic unit busy with a multiplication or 
addition every clock cycle; a new multiplication 
and accumulation is performed every two clock 
cycles. The partial results remain on-chip until 
the multiplication and accumulation is com- 
pleted, thus eliminating I/O delays and the 
more complex programming that would result 
from having the adder and multiplier on sep- 
arate chips. 


Some real applications 


A more specific application for the Am29325 
could be its use as the computational engineina 
fast Fourier transform (FFT) processor. Dur- 
ing a FFT operation, word growth is incurred in 
the butterfly calculation, and if the FFT pro- 
cessor uses integer arithmetic, word growth 
can cause a system overflow. To prevent over- 
flow, a scaling operation must be performed on 
the data. The overhead involved in checking for 
word growth overflow and scaling of data can 
be avoided by using floating-point arithmetic. 
Floating-point provides not only greater dy- 
namic range but in most cases also provides 
greater precision (24 bits of significance versus 
16 bits in a typical integer system). 

A powerful, low-cost system that executes 
FFTs can be built around the floating-point 
processor (Fig. 5). It consists of a floating-point 
arithmetic processing unit, a data and coeffi- 
cient address generator, a data and address 
storage block, high-speed data and coefficient 
memories, a system controller, clock generator, 
and host interface. Input operands to the R port 
are fed from the data store, while data to the S 
port is fed from the coefficient memory. The re- 
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sult of an arithmetic operation may be stored 
back in the data memory. An exclusive-OR gate 
is also available to complement the sign of the 
result, effectively multiplying the cperand on 
the F bus by —1. For most operations, inter- 
mediate results can be held within temporary 
registers in the floating-point unit; only the fi- 
nal result need be sent off chip. 

The high-speed data memory is made up of 
RAMs, the coefficient memory of PROMs. The 
data memory can be loaded with data from the 
host or can store results that have been pro- 
cessed through the floating-point chip. Once all 
data or results have been stored, the data 
memory is ready for use in an operation, or for 
transfer back to the host system. The coeffi- 
cient PROMs contains the sine and cosine data 
required for an FFT, while the data store holds 
frequently used operands. 

During the calculation of a butterfly, the 
same operands must be used in several differ- 
ent cycles—and since the data store reduces the 
number of memory read operations required, it 
speeds up data access. As the butterfly se- 
quence progresses, the appropriate address is 
available from the address store, which con- 
sists of two more multilevel pipelined registers. 

The host interface consists of a DMA channel 
that can perform high-speed block data trans- 
fers between the host system and the data 
memory. The system controller communicates 
with the host to receive or transfer data. It gov- 
erns which operations are to be performed and 
how to perform them. Instructions are issued 
by the host computer, via the host interface, to 
the system controller, and the system control- 
ler informs the host when the operation is done. 

The system controller consists of an 
Am29331 or similar microsequencer, and a mi- 
crocode program stored in registered PROMs. 
The system clock generator uses an Am2925. 
The architecture allows a ten-cycle butterfly 
FFT to be executed (see Fig. 5 again) using a 
radix-2 decimation-in-time (DIT) algorithm. 
The equations for a radix-2 DIT algorithm are: 

A’=A+BWB 
B’ = A — BW, where all values are 
complex 


In cycles 1, 2, and 3, the first three operands 
are read from the data memory. Because of the 
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overlapping butterflies, this read takes place 
while the previous butterfly is still being pro- 
cessed. In the following two cycles, data writes 
of the previous butterfly occur while the com- 
plex multiplications of (BW) are being per- 
formed. Cycle 6 reads in a new operand for the 


Host interface 
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generator 
(Am29540) 


Coefficient 
memory 
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Microcode 
controller 
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5. To build a fast-Fourier transform processor that 
uses the floating-point processor as its heart re- 
quires only a few control chips and some memories. 
Use of the Am29540 and Am29332 LSI building 
blocks helps keep the circuitry simple. 
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present butterfly and sums together the two 
products from the two previous cycles. In cycles 
7 and 8, the real part of A’ and B’ is formed. In 
cycles 9 and 10, the real part of A’ and B’ is writ- 
ten to memory. Also, during these two cycles 
the other product pairs of (BW) are formed. 
During cycles 11, 12, and 18, data for the next 
butterfly is read, and as part of cycles 12 and 18, 
the imaginary part of A’ and B’ is formed. In the 
following cycles the imaginary part of A’ and B’ 
is written to memory and processing of the next 
butterfly is initiated. The real and imaginary 
components of B’ have a negative sign, and can 
be corrected by complementing the sign. Count- 
ing the number of cycles from the first read or 
write of one butterfly to the next, it can be seen 
that a butterfly is computed every 10 cycles. 


The big system picture 


Although the floating-point chip fits well in 
small systems, it is also easily incorporated in 
larger, more powerful configurations. In one 
such system, a high-speed, microprogrammed 
integer and floating-point processor can be 
readily tailored to implement signal process- 
ing, image processing, or graphics algorithms 
(Fig. 6). The processor consists of a two-level 
controller, data and coefficient memory, ad- 
dress generator, and arithmetic unit. These 
functional blocks are considerably more flex- 
ible than their countezparts in the simpler FFT 
system. 

The controller is divided into two levels, or 
sections: program and microprogram. In the 
topmost or program section, an Am2910A mi- 
croprogram controller addresses a program 
memory that contains high-level instructions, 
or macros. These macros implement build- 
ing-block operations; a graphics processor, for 
example, might have macros called Translate 
and Rotate that move objects in three-dimen- 
sional space. Each macro would carry with it 
parameters relevant to its operation, such as 
memory pointers or iteration count. 

The program section passes address-related 
parameters to the address generator, and 
passes the iteration count and the decoded mi- 
croinstruction start address to the micro- 
program section of the controller; this section 
then provides cycle-by-cycle control of pro- 
cessor resources during the execution of a 
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macro. The heart of the microprogam section is 
an Am29331 microprogram controller—it ad- 
dresses a microcode memory, in which the mi- 
croprogram sequence for each macro type is 
stored. 

The microprogram controller was chosen for 


Address bus 


Data bus 


three reasons: first, it can address up to 64 
kwords, which makes possible a deep micro- 
program memory that can store many oper- 
ation sequences. Second, its high speed permits 
the use of slower, less expensive microprogram 
memory, a particularly important considera- 
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6. A versatile, yet high-pertormance microprogrammable system can be built by including both the 
floating-point processor and the 32-bit multiplier into a system that uses the other Am29300 build- 
ing blocks to form the control and address generation sections. 
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tion when the microprogram is large. And 
third, its micro-interrupt feature can be used to 
efficiently implement exception handling for 
arithmetic operations. By using interrupts for 
these exceptions, the overhead otherwise in- 
curred in testing status flags can be greatly 
reduced. 

The data and coefficient memories store in- 
put data, output data, and constants. In this ap- 
plication, data and coefficient memory have 
been separated from program memory. Some- 
times referred to as a Harvard architecture, 
this approach increases throughput by allow- 
ing instruction fetch and operand fetch oper- 
ations to proceed in parallel. 

The address generator comprises a Am29332 
ALU and two Am29334 register files. The reg- 
ister file stores up to sixty-four 32-bit base ad- 
dresses and pointers. The Am29332 creates a 
32-bit effective address from these bases and 
pointers, with the calculation assuming the 
forms: 


base + pointer 
base — pointer 
base 


or pointer 


In addition, the Am29382 can perform mask, 
shift, and merge operations in a single cycle. 
This feature can be used to quickly calculate 
matrix addresses of the form: 


a2" + b, 


where a and bare the row and column indices of 
the matrix element to be accessed. The combi- 
nation of a 32-bit effective address and efficient 
matrix addressing makes this address gener- 
ator particularly attractive for applications 
such as image processing, in which matrices 
must be plucked out of very large data arrays. 

The arithmetic unit contains three arith- 
metic facilities—an Am29325 for floating- 
point operations, and the Am29332 and 
Am29323 for integer and logical operations. 
These devices accept data from a six-port reg- 
ister file made of four Am29334s. The register 
file has three purposes—it acts as a fast, tem- 
porary scratchpad for data, it routes data 
among arithmetic devices (the output of one 
arithmetic device can be written to the register 
file, and be used as an input operand by another 
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such device during the following clock cycle), 
and it provides access to four data words every 
clock cycle, so that two or more arithmetic de- 
vice can operate in parallel. 

An example of this parallelism is integer 
multiplication-accumulation: because the 
Am29323 and the Am293382 receive operands in- 
dependently, an integer product and sum can be 
calculated every clock cycle. The register file 
can then pass products from the Am29323 to the 
Am29382, for a throughput of one clock cycle 
per multiplication-accumulation. 

Operation of the processor might be best un- 
derstood by considering the execution of a 
typical macro. For graphics applications, one 
such macro is Translate, with which a set of 
points in three-dimensional space is moved ina 
given direction. The set of points is described by 
alist of vectors (X;, Y;, Z:), while the translation 
is described by vector (Sz, Yr, Zr); each vector is 
stored in three contiguous data memory lo- 
cations. Translation is performed by adding the 
translation vector to each entry in the vector 
list. 

The translation process begins when the mi- 
croprogram controllers encounters a Translate 
instruction in program memory. The Translate 
instruction is accompanied by three parame- 
ters: the start address of the translation vector, 
the start address of the vector list, and the num- 
ber of vectors in the list. The first two parame- 
ters are passed to the address generator, the 
third to the iteration counter. 

The microprogram section of the controller 
then assumes command, accessing the micro- 
code for the Translate instruction. The micro- 
code controls the address generator and arith- 
metic unit, specifying the operations needed to 
fetch each vector from the vector list, add the 
translation vector, and return the modified vec- 
tor to the data memory. After all vectors in the 
list have been processed (as indicated by the it- 
eration counter), control is returned to the 
Am2910A program sequencer, which then ac- 
cesses the next macro from program memory.O 
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Designed using concepts of functional partitioning, three-bus 
architecture, and fault detection, a family of 32-bit building 
blocks can satisfy the needs of both general-purpose 
computing and signal processing. 


by Timothy J. Flaherty 


AS processing rates, machine densities, and system 
reliability requirements increase, functional integra- 
tion at the device level becomes mandatory in high 
performance controllers and processors. When used 
as standalone devices or combined in a high speed 
system, functional building blocks can provide solu- 
tions to a wide range of design problems. They fit 
equally well into a general-purpose computer and 
a digital signal processor, despite the great functional 
differences between these two systems. 

As device densities have increased over the years, 
system word widths have grown, bringing greater 
precision and allowing a larger memory space to 
be addressed. The jumps from 4- to 8-bit and from 
8- to 16-bit systems occurred relatively quickly. 
The leap to 32-bit systems has already taken place, 
bringing with it a slowdown in the quest for wider 
system words. Partitioned into 32-bit building 
blocks, Advanced Micro Devices’ Am29300 family 
integrates functions that are difficult, if not impos- 
sible to implement with bit-slice devices. These func- 
tions include barrel shifting, priority encoding, and 
mask generation. 

Whenever carry-lookahead logic can be contained 
in the same device as the arithmetic logic it supports, 
cycle time is improved. In fact, by reducing the 
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amount of intrafunction communication across chip 
boundaries, cycle time no longer has to depend on 
the speed of the interface between components. Be- 
cause of this, all intrafunction communications were 
eliminated in the Am29300 family. Pipelining can 
result in the faster execution of certain highly repeti- 
tive operations, but system latency increases. In some 
cases, this latency will actually degrade throughput. 
In a recursive algorithm, where a calculation depends 
on the immediately preceding result, true through- 
put can be lost while waiting for intermediate results 
to work their way through the pipe. To maximize 
performance without sacrificing architectural flexi- 
bility, intrafunction pipelining was also eliminated 
in the Am29300 devices. 

A three-bus, flow-through architecture comple- 
ments functional partitioning in this chip family. 
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The data path members share a common bus con- 
figuration with two input operand buses and one 
output bus. Independent of each other (neither 
bidirectional nor shared), these buses provide 
maximum accessibility. 

Bidirectional or shared I/O buses limit the speed 
at which information can be transferred between 
different parts in the system. Achieving rapid turn- 
around of bidirectional TTL data buses is often an 
arduous task. And shared input buses require greater 
timing restrictions than do nonshared buses. These 
limitations have been eliminated in the Am29300 
data path devices by removal of shared or bidirec- 
tional buses. 

A high data transfer bandwidth is achieved with 
the flow-through architecture. This direct access 
allows the designer to tailor the system’s register file 
to the specific application rather than forcing use 
of a fixed, more general memory organization. The 
beauty of the three-bus architecture lies in its sim- 
plicity. This straightforward structure permits many 
possible component configurations optimized for 
different micro-architectures. 

Simple, internal I/O registers may introduce un- 
wanted pipeline delays. A flexible register structure 
requires that any I/O registers can be made trans- 
parent. The input and output registers on both the 


Am29323 parallel multiprecision multiplier and the 
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The three-bus, flow-through architecture affords the 
greatest access to the device cores. Byte-parity 
checking detects connection failures between devices. 
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Master/slave ‘checking provides device failure 
detection using redundant parallel devices. This 
happens without incurring the delay of a ‘‘voting 
scheme often used in high reliability designs. 


Am29325 floating point processor can be made trans- 
parent independently, providing a number of differ- 
ent register configurations including flow-through. 


Fault detection 

The philosophy governing this chip family is maxi- 
mum functionality with minimum impact on system 
cycle time. The methods used for fault detection put 
this idea into practice. 

The 32-bit family addresses fault detection at the 
component level using a twofold scheme—byte par- 
ity and master/slave checking. To detect intercon- 
nection failures, byte parity is both generated and 
checked by the data path elements of the family. The 
byte parity circuitry checks for single bit failures 
across each byte of the two input operands. Even par- 
ity checking was chosen for this family of TTL- 
compatible parts instead of odd parity to provide the 
additional check for bus failure. Any parity faults 
detected cause assertion of the parity error 
(PARERR) flag. 

Master/slave checking detects failures at the device 
level. When using this mode, two devices are oper- 
ated in parallel, each receiving the same data and 
instructions. The master device generates its result 
and transfers this information to the output bus. The 
slave device generates its own result from the same 
inputs; instead of delivering this data, however, the 
slave reads the output bus and compares the master’s 
results with its own. The hard error (HARDERR) 
flag indicates any discrepancies between the two out- 
puts. Moreover, the assertion level of both the par- 
ity and hard error flags indicates device failure due 
to loss of power and error signal faults. 
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Both error checking schemes operate on a cycle- 
by-cycle basis so any detected fault triggers an in- 
terrupt at the microinstruction level. Unlike other 
redundant schemes, specialized software is not re- 
quired, and system performance is not affected 
by the communication between redundant func- 
tional units. 


Cycle time and control paths 

In high performance system design, the system’s 
intended operations must be given, with careful con- 
sideration paid to required cycle time. The cycle time 
depends on the type of operation the system per- 
forms. The design should be optimized for quick exe- 
cution of the instructions that make up the largest 
percentage of the system’s operations. 

Complex operations requiring long cycle times, 
but used infrequently, should be performed over 
multiple cycles. For example, a complicated arith- 
metic procedure such as division should not deter- 
mine the cycle time of the system if the operation 
only used a small portion of the time. On the other 
hand, if this operation is used frequently, the sys- 
tem should be made to handle it efficiently. 

Comparing the instruction mixes of a general- 
purpose processor and a dedicated array processor 
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illustrates this point well. Multiplication operations 
dominate the array processor’s instruction set, while 
the general-purpose machine’s set would be less 
multiplication intensive. The Am29323 parallel 
multiplier would enhance an array processor by 
providing a high speed, single cycle 32- x 32-bit 
multiplication. The general-purpose machine might 
not need a dedicated multiplier and could fare well 
with the Am29332 ALU chip and its multiple cycle 
multiplication capability. 

When optimizing the system for speed, a designer 
should remember the control path. By causing a 
change in the normal flow of information in the con- 
trol path, conditional branching often becomes the 
system bottleneck. Conditional codes must be 
checked to determine the next address, but this check- 
ing can extend the cycle time. The speed of the con- 
trol path must remain on a par with the speed of the 
data path. The Am29331 microprogram sequencer 
architecture balances the timing between the control 
and data paths. 

By integrating the conditional code multiplexer, test 
logic, and polarity control logic in the same device, 
cycle time is reduced by eliminating intrafunction 
delays. The microprogram sequencer can perform 
four sets of 16-way branches upon the simultaneous 


The 32- x 32-bit parallel 
multiprecision multiplier 
provides dual input registers 
to support extended 
multiplications. An internal 
wrap-back path and shifter 
eliminate transferring data 
offchip for shifting. 
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A family gathering 


A member of the Am29300 family, 
the Am29332 32-bit noncascad- 4 
able ALU chip, was designed for § 
systems requiring fast number 
crunching, high data transfer 
rates, and powerful bit-manipula- 
tion capabilities. The internal 
data path of the ALU chip inter- 
connects several functional 
blocks. These blocks include a 
mask generator, a funnel shifter, 
an ALU, and a priority encoder. 
The ALU chip allows operations 
that once took multiple cycles to 
be executed in a single cycle. 

The ALU chip uses a 64- to 

32-bit funnel shifter to performa 
full complement of N-bit shifts, 
N-bit rotates, field extractions, and field logical 
operations in a single cycle. This funnel shifter 
works on either one or both input operands. Such 
shifting is extremely useful in such operations as 
floating point mantissa normalization or denormali- 
zation, and in applications where packing and un- 
packing of data is a frequent task. Also, the ability 
to extract a 32-bit contiguous field from two oper- 
ands provides a useful function in many graphics- 
related operations. The output of the funnel shifter 
is directed to the R input of the ALU, allowing logi- 
cal operations to then be performed on the shifted 
word. The ALU section of the Am29332 has three 
input ports. One input comes from the funnel 
shifter, another from the mask generator, and a third 
can be selected from various sources including 
both input buses. This three-input ALU allows 
merger of two instructions into a single cycle. 

The Am29325, a single-precision floating point 
processor, integrates a fully combinatorial 32-bit 
floating point adder/subtractor, multiplier, and 
data path in a single chip. This integration min- 
imizes processing overhead. The floating point 
processor supports both IEEE P754 and Digital 
Equipment Corp floating point formats. All instruc- 
tion—addition, subtraction, multiplication, float- 
ing point/integer conversions, and IEEE/DEC 
conversions—are performed in a single clock 
cycle. There are no internal pipeline delays to limit 
true throughput. 

The core of the floating point processor is a 
3-port arithmetic unit containing a mantissa proces- 
sor, an exponent processor, and additional logic re- 
quired to implement floating point operations. 

The Am29323 is a 32- x 32-bit parallel multi- 
plier with multiprecision capabilities designed 
to perform a 32- x 32-bit multiplication in a single 
cycle. The parallel multiplier also supports multi- 
ple cycle, multiprecision multiplications. Using 
a 67-bit onboard accumulator and internal wrap- 
back paths, this device can perform a 64- x 64-bit 
multiplication every four cycles. This part also 
supports 96- x 96-bit and 128- x [28-bit multiplica- 
tions. These expanded multiplications offer support 
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to extended and double- 
precision format floating point 
multiplications. 

To provide a flexible interface 
for a variety of applications, the 
parallel multiplier has dual 32-bit 
registers on each input bus. Both 
halves of a 64-bit input word can 
be loaded, stored, and selected 
as needed when extended multi- 
plications are performed. The 
input registers can be made 
transparent and the outputs can 
be selected directly from the 

4 array core to provide a high 

speed multiplier acceleratorina 

4 system designed with the other 

members of the Am29300 family. 

The task of reducing cycle time in a micro- 

programmed system—a primary goal for the 

Am29300 family—is assisted by the Am29331 pro- 

gram sequencer. The critical path in the control sec- 

tion of asystem typically passes through the status 

register through test logic, test multiplexer, 

sequencer, and microprogram memory. The micro- 

program sequencer removes the ‘‘control bottle- 

neck” by integrating the test logic, multiplexer, 
and sequencer. 


Handling interrupts 

Interrupts and polling both handle asynchronous 
events. But in interrupts, unlike in polling, explicit 
tests in the microcode are not required. Quicker 
response times and less microcode are the reasons 
the microprogram sequencer uses interrupt han- 
dling. When an interrupt is received, the interrupt 
return address is pushed on an internal 33-level 
stack allowing nested interrupts. 

Interrupts are handled by the sequencer at the 
end of amicrocycle. Traps, on the other hand, must 
be handled before the end of the microcycle. Be- 
cause they indicate an unexpected condition 
caused by the current microinstruction, traps cause 
the sequencer to halt the operation before the cur- 
rent instruction changes the state of the system. 
When a trap occurs, the current microinstruction 
must be aborted and re-executed after the trap han- 
dling routine has taken corrective measures. 

The Am29334, a high speed, 64- x 18-bit, dual- 
access RAM, provides the Am29300 family with flex- 
ible, configurable memory. The device’s dual read/ 
write ports allow simultaneous access for two oper- 
ations every cycle: two operand fetches, aread and 
write to two locations, or two write operations. 

Because it can be expanded in both width and 
depth, the register file allows several memory con- 
figurations. Two of these devices may be hooked 
together in an expanded 6-port configuration, for 
example. This setup allows two processors to oper- 
ate on the same memory simultaneously. Four reads 
and two writes every cycle could provide high speed 
local memory, possibly configured as a cache. 





occurrence of four external test conditions. This abil- 
ity to handle multiway branching greatly reduces the 
branching delay penalty. 


Data routing 

A system should be able to route data punctually 
to the proper location—a task as important as reduc- 
ing cycle time. Cycles wasted while waiting for results 
to work their way out of the pipeline and into the 
arithmetic unit where they are needed degrade per- 
formance. Bus bandwidth is lost by the redundant 
transferring of intermediate results back and forth 
from memory. And cycles lost shuffling data reduce 
true throughput. 

Often data is fetched from one memory location, 
processed, and the result of the operation returned 
to the original memory location. The Am29334 
register file supports these read/modify/write oper- 
ations by allowing a single cycle read and write 
memory operation to the same location. The register 
file’s internal circuitry makes this operation possible 
without requiring external hardware to store the 
modified data temporarily. 

Maximum bus bandwidth requires operands to be 
in the right place at the right time without monopoliz- 
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ing the bus structure. Redundant data transfers, such 
as returning intermediate sums from a sum-of- 
products operation to memory, only congest the bus 
structure and reduce bandwidth. The Am29325 
floating point processor provides internal wrap-back 
paths and handles such data routing onchip. These 
internal wrap-back paths for sum-of-products oper- 
ations with intermediate results double the band- 
width of a bus shared between multiple processors. 

The Am29323 parallel multiplier also provides 
internal wrap-back paths and shifting circuitry for 
extended multiplications. These elements eliminate 
the delays resulting from data leaving the chip, being 
adjusted by an external shifter, and then returning 
to the device. The parallel multiplier has dual 32-bit 
input registers to support the cross-products needed 
for multiprecision multiplications. These registers 
also reduce bus congestion by eliminating the need 
for redundant memory fetches. 


System bus structures 

A general-purpose CPU falls short of today’s 
number crunching requirements because it cannot 
take advantage of highly structured array and digi- 
tal signal processing algorithms. The differences 


COEFFICIENT 
MEMORY 
{FOUR Am27S281s) 


A 32-bit floating point/ 
integer processor can be 
designed using a parallel 
configuration. In the system 
shown, the floating point 
processor shares three 32-bit 
buses with the ALU chip and 
the parallel multiplier. 
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between a general-purpose CPU design and an array 
processor design allow optimization of the configura- 
tions to serve specific needs. 

Different system designs often have different data 
bus requirements. Parallel configurations are useful 
and easily implemented. In a general-purpose ma- 
chine, for example, several processing units might 
share the same data buses. The Am29332 ALU chip 
can share three 32-bit buses with the parallel multi- 
plier and the floating point processor. Data could be 
passed from one processor to another through the 
shared register file. 

Although parallel configurations fit a general- 
purpose design, specific processors may require 
different bus structures. With a dedicated bus 
structure and paired Am29331 RAMs configured 
as a 6-port register file, for example, a high speed 
system that is well-suited for matrix processing 
can be designed. For array processing this arrange- 
ment offers a distinct advantage over the shared 
bus system. 

To perform a multiplication/accumulation, the 
Am29323 multiplies two 32-bit numbers, the product 
is passed through the register file and added to the 
previous product by the ALU chip. Performing the 
multiplication and the addition in parallel, results in 
an effective throughput of one multiplication/accum- 
ulation per clock cycle—twice that of the system with 
shared buses. 


Status generation 

The status flag generators on Am29300 data path 
devices create flags that indicate where significant 
events occur during the calculations. These flags, 
generated as the operation is performed, reduce the 
processing overhead. The fully decoded flags mini- 
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mize the amount of hardware needed for status inter- 
pretation. Such special conditions as a zero result or 
a byte carry may provide the user with important 
information about the calculation. A zero result— 
reported via the ZERO flag—is useful in compari- 
son operations, for example. 

Many of the status flags report such exception con- 
ditions as underflow, overflow, and invalid. Each of 
these conditions would indicate that the result ob- 
tained is not correct. These flags are active whether 
or not the output bus is enabled. In this way, the sta- 
tus of such iterative operations as floating point multi- 
plication/accumulation can be monitored without 
enabling the output bus to check each intermediate 
calculation for exception conditions. This will reduce 
both hardware requirements and bus congestion. 

Many of the conditions reported by the status flags 
indicate a problem with the current operation. The 
INVALID flag on the floating point processor, for 
example, indicates an invalid operation has been at- 
tempted. This flag can be used to generate an inter- 
rupt. The microprogram sequencer handles this 
interrupt at the microprogram level. After accepting 
the interrupt, the sequencer allows an external inter- 
rupt handling address to gain access to the micropro- 
gram address bus. This address begins the interrupt 
handling routine. The microprogram sequencer saves 
the interrupt return address on an internal stack. 

Traps are unexpected situations caused by the cur- 
rent microinstruction, which must be handled before 
the end of the current microcycle. Conditions such 
as overflow can be trapped so corrective action can 
be taken. 

Suppose the current instruction requires a read 
from memory locations A and B, a floating point 
addition using this data, and a write of the result 


DATA 
WRITE 
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port register file. With this 
setup, four reads and two 
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back into location A. If this addition operation 
results in an overflow and the result is written back 
into location A, information may be lost. This may 
happen if the OVERFLOW flag is used to generate 
an interrupt. A trapping setup, however, offers a 
different scenario. 

If the OVERFLOW flag is used to indicate a trap, 
the operation can be interrupted before the overflow 
result can be written over the data in memory loca- 
tion A. An overflow trap handling routine scales both 
operands. Upon re-execution of the addition opera- 
tion, the result does not overflow. The microprogram 
sequencer pushes the address of the current microin- 
struction onto its internal stack and allows the trap 
handling address to gain access to the microprogram 
address bus. After completion of the trap handling 
routine, the trapped instruction address is popped 
from the stack and re-executed. 


Multiway branching 

The Am29331 address sequencer’s multiway 
branch instructions allow the selection of 16 consecu- 
tive addresses as a branch target. Generated in a single 
cycle, the address consists of the upper 12 bits from 
the D bus concatenated with 4 bits from the multi- 
way inputs. This type of branching allows the test- 
ing of up to four conditions in a single clock cycle. 

Four multiway sets of 4 bits each allow designers 
to group test conditions according to type. The 2 least 
significant bits of the D bus, DO and D1, control 
which 4-bit multiway is selected. 

The multiway-branch feature provides designers 
with a hardware solution to the problem of per- 
forming certain high level software instructions in a 
single cycle. Many combinations of conditions could 
be arranged. For example, CARRY, ZERO, and 
NEGATIVE flags from the Am29332 ALU chip 
might be used to perform If (AND/OR)-Then 
(AND/OR)-Else operations. By using multiway 
branching, the AND/OR functions in the If-Then- 
Else statement do not incur the penalty of additional 
gates or additional delay. 
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Am29323 


32-Bit Parallel Multiplier 


ADVANCED INFORMATION 


DISTINCTIVE CHARACTERISTICS 


® 32-Bit Three-Bus Architecture 
- The device has two 32-bit input ports and one 32-bit 
output port with maximum multiply time of 80ns 
@ Single Clock with Register Enables 
- The Am29323 is controlled by one clock with 
individual register enables 
@ Supports Multiprecision Multiplication 
- The device has dual 32-bit registers on each data 
input port to perform multiprecision multiplication 


@ Registers can be made transparent 

- Input and output registers can be made transparent 
independently to eliminate unwanted pipeline delay 

Supports Two's Complement, Unsigned or Mixed 

Numbers 

Data Integrity Through Master-Slave Mode and Pari- 

ty Check/Generate 

- Parity check/generate catches inter-device 
connection errors and master/slave mode provides 
complete function check 


GENERAL DESCRIPTION 


The Am29323 is a high-speed 32 x 32-Bit Parallel Multipli- 
er with 67-Bit Accumulator. The part is designed to maxi- 
mize system level performance by providing a 32-bit three 
bus architecture and a single clock with register enables. 


The Am29323 further enhances the system throughput by 
providing individual register feedthrough controls, byte 
parity checking on both input ports and generation on the 
output port, and dual input registers on each data input bus 
to support multiprecision multiplication. The Am29323 can 
manage a wide variety of data types, including two's 


complement, unsigned, or mixed mode input formats. A 64 
x 64-bit multiplication can be performed in seven clock 
cycles, including input and output. Additional features 
provided are a format adjust control allowing for standard 
output or left shifted output suitable for fractional two's 
complement arithmetic, rounding, and master/slave opera- 
tion. 


The Am29323 is designed with the IMOX" process, which 
allows internal ECL circuits with TTL-compatible 1/O. The 
device is housed in a 168-lead pin-grid-array package. 


SIMPLIFIED BLOCK DIAGRAM 
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PIN DESCRIPTION 


X31-Xo 
Y31-Yo 
P31-Po 
TCX, TCY 


Multiplicand data input port. 
Multiplier data input port. 
Product output port. 


Mode control inputs for each input data 
word; LOW for unsigned data and HIGH 
for two's complement format. 


Accumulator contro! lines used to 
determine accumulator function; PASS, 
ACCUMULATE, SHIFT/ACCUMULATE. 


Round control for rounding the most 
significant product. 


ACC1, ACCO 


Clock; all registers. 


Register enables for multiplicand data 
input registers (XA and XB). 


Register enables for multiplier data input 
registers (YA and YB). 


Register enable for accumulator product 
register (P). 


Register enable for instruction register (I). 


Register enable for temporary register 
(7). 


Control line used to route the contents of 
either the XA register (HIGH) or XB 
register (LOW) into the multiplier array. 


Control line used to route the contents of 
either the YA register (HIGH) or YB 
register (LOW) into the multiplier array. 


FUNCTIONAL DESCRIPTION 
Architecture 


The Am29323 comprises a high speed 32 by 32-bit multiplier 
array, a 67-bit accumulator, and a 32-bit data path. 


Multiplier Array 


The multiplier is a 32 by 32-bit array which produces a 64-bit 
product. This product is then fed to the accumulator section. 


Accumulator 


The accumulator is 67 bits wide. It performs accumulation for 
sum of product operations and multiprecision multiplication 
operations. The accumulator can perform three operations: 
store product without accumulation, accumulate product, and 
shift accumulator value and accumulate with product. 


Data Path 


The 32-bit data path consists of X and Y input buses; the P 
output bus; data registers XA, XB, YA, YB, and the product 
accumulator; two multiplier input multiplexers; byte parity input 
checkers; byte parity output generators; and master/slave 
comparators. input operands enter the device through the two 
32-bit input buses, Xq- X34 and Yo-Y31. These operands 
may then be stored in one of the two registers for each bus 
(XA or XB for X, YA or YB for Y) or they may be fed directly 
through to the multiplier array. Input parity checking is per- 
formed as soon as the operands are put on the input buses. 
The signals used for output parity generation are taken from 
the input side of the output translator. 


FA Format adjust select either a full 64-bit 
product (HIGH) or a left-shifted 63-bit 
product suitable for fractional two's 
complement arithmetic (LOW). 


Select control line used to route the most 
significant product register (HIGH) or the 
least significant product register (LOW) 
into the temporary register. 


FTX, FTY, Feedthrough control lines for X, Y, and | 
FTl registers. 


FTP Bypass control for output multiplexer. 


PSEL1, PSELO Product control lines used to select 


desired output including disabling P 
output port. 


PX3-PX 
PY3-PY9 
PP3-PP9 
PARERR 


Byte parity inputs on X input port. 
Byte parity inputs on Y input port. 
Byte parity outputs on P output port. 


Parity error flag indicates a parity error on 
the input buses. 


OE Output enable control line used to disable 
the P output port. 


Master/Slave control line used to 
determine mode of operation. 


SLAVE 


HARDERR Hard error flag used when two Am29323s 
are configured as master and slave to 


indicate hardware errors, 


Operational Modes 


The Am29323 can perform signed, unsigned, or mixed mode 
multiplication. These different numerical representations are 
controlled by TCX and TCY. A HIGH input on one of these 
lines indicates to the device that the respective input should 
be treated as a two's complement number; a LOW, an 
unsigned number. The output format is unsigned when both 
inputs are unsigned. The output format is two's complement 
when either or both inputs are two's complement. 


Command Description and Formats 


The accumulator is controlled by ACCO and ACC1. These 
lines are used to select any of the three operations that the 
accumulator can perform. This instruction set is described in 
Table 1. 


The temporary output register is controlled by TSEL and FA. 
These lines are used to select any of the four different sets of 
data that can be stored in the temporary register. This 
instruction set is described in Table 2. 


The output multiplexer is controlled by PSELO, PSEL1, and 
FA. These lines are used to select any of the five different sets 
of data that can be output through the P port. PSELO and 
PSEL1 can also be used to disable the outputs. (This 
instruction is independent of OE.) This instruction set is 
described in Table 3. 


Format Adjust (FA) is used to select either a full 64-bit product 
or a left-shifted 63-bit product suitable for fractional two's 
complement arithmetic. This shifting increases the precision of 
the upper half of the product word by eliminating the redun- 
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dant sign bit. Output Data Formats shows the effect of FA. 
(page 5). 


User Visible Register Descriptions 


The Am29323 contains seven different register sets, each with 
its own clock enable. Two 32-bit registers are attached to each 
of the input data buses. These registers are differentiated by 
the suffix A or B. For example, the X bus has registers XA and 
XB. The 67-bit accumulator register can be used as a regular 
product register when the part is used as a multiplier only or as 
the register part of the accumulator section. The 32-bit 
temporary output register is included to aid in the pipelining of 
muitiprecision multiplication operations. An instruction register 
is also provided. 


All of these registers can be made transparent with the 
exception of the accumulator register and the temporary 
register. The product from the multiplier can be fed directly to 
the output by using the FTP control line. 


TABLE 1. ACCUMULATOR OPERATION 
INSTRUCTIONS 


[Ac6+ [AGG0 | Accumulator Operation 
Popo lms 
a 
a 


TABLE 2. INPUT SELECT INSTRUCTIONS FOR 
TEMPORARY (T) REGISTER 


[reel] FA | Temp Reg taut | 
Pefola, | 
ref[afa +d 
a, (pasa 


TABLE 3. OUTPUT SELECT INSTRUCTIONS FOR 
PRODUCT (P) PORT 


[rseis [pscio[ ra | P Pon Outpt 
po | 0 | x | Tew resister 


Am29323 X AND Y INPUT DATA FORMATS 


Fractional Two's Complement 


TCX, TCY=1 


30 29 28 27 26 - 
aot o2 93 94 95 


= 3 2 1 0 
2-28 9-29 5-30) 9-31 


Integer Two's Complement 


TCX, TCY=1 


Unsigned Fractional 


TCX, TCY=0 


- 3 2 1 0 
9-29 9-30) 9-31 9-32 


Unsigned Integer 


TCX, TCY=0 


31 30 29 28 26 - 
33193029 B kT) 6 





Am29323 P-PORT OUTPUT DATA FORMATS 
Fractional Two's Complement (Shifted)* 


FA=0, PSEL1=1, PSELO=0 


31 30 29 28 27 26 - - - - 3 2 1 0 
_20 o-1 9-2 o-3 o-4 a-5 2-28 9-29 2-30 9-31 


FA=0, PSEL1=0, PSELO=1 


31 30 29 28 27 26 - - - = 3 2 1 0 
9-32 9-33) 9-340 9 35 9 3697 9-60 5-61 5-62 5-63"" 


Fractional Two's Complement 


FA=1, PSEL1=1, PSELO=0 


31 30 29 28 27 26 - - - - 3 2 1 0 
21 20 9-1 9-2 9-3 o-4 a-27 9-28) 9-29) 9-30 


FA=1, PSEL1=0, PSELO= 1 
31 30 29 28 27 26 - - - - 3 2 1 0 


9-31 9-32 9-33 9-34 9-35 9-36 9-59 9-60 5-61 5-62 
Integer Two's Complement 


FA=1, PSEL1=1, PSELO=0 


Unsigned Fractional 


FA=1, PSEL1=1, PSELO=0 


29 28 27 26 - a = - 3 2 1 0 


FA=1, PSEL1=0, PSELO= 1 
30 29 28 27 26 - - - - 3 2 1 0 


9-33 9-34 5-35 5-36 9-37) 9-38 9-61 5-62 5-63 5-64 
Unsigned Integer 


FA=1, PSEL1=1, PSELO=0 


FA=1, PSEL1=0, PSELO=1 


23 22 21 20 


*In this format, an overflow occurs in the attempted multiplication of the two's complement number - 1.000 with itself, yielding a 
product of +1.000 which cannot be represented in this format. **This bit position (2-83) equals zero in this format. 
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264 x 64 Multiplication 


To perform a 64x 64-bit multiplication using the Am29323, 
each 64-bit input must be split into two 32-bit inputs; a most 
significant half and a least significant half (XW1 and XWO or 
YW1 and YWO, respectively.) These 32-bit inputs are then 
used to perform the four multiplications needed to obtain the 
128-bit product. This product is represented in four 32-bit 
words, PW3 - PWo. The least significant word being PWo. The 


product is output 32 bits at a time through the product (P) port. 
The following equation shows the required multiplications: 


X * Y=((XW1 * YW1) * 264) 4+ ((XWO * YW1) * 292 
+ ((XW1 * YWO) * 294) + ((XWO * YWO) * 2%) 
= (PW3 * 296) + (Pw2 * 264) + (PW1 * 294) 
+ (PWO * 2%) 
The Am29323 uses an internal accumulator to sum these 


intermediate products. The previous equation, in a slightly 
different form, is shown with the necessary instructions below: 


P => PW3 


XWO 
YWO 


* YWO < Multiply only 
Mult & Shift/Acc 
= Mult & Accumulate 
Mult & Shift/Acc 


PWO 


xw1 
YW1 


XWO 

Xw1 * YWO 

Xwo * YW1 
* YW 


PWw2 


XW 
PW1 


Table 4 details the movement of the input operands through 
the Am29323. Table 5 defines the microcode required to 
perform a signed 64 x 64-bit multiplication. For an unsigned 
multiplication, TCX and TCY are LOW for all cycles. The 
operations and data movement are scheduled to produce a 
single product in seven clock cycles or a new pipelined 
product every four clock cycles. 


TABLE 4. BUS AND REGISTER CONTENTS FOR A 64x 64-BIT SIGNED MULTIPLICATION WITH ONE 
COMPLETE EXTENDED MULTIPLICATION SHOWN IN THE UNSHADED CYCLES 


Note: MPY OP = Operation of multiplier array (X*Y) 
ACC OP = Operation of internal accumulator 
PASS = Pass through multiplier product 
ACC = Add previous result to current product 
S/A = Shift previous result then add to current product 


TABLE 5. INSTRUCTION MICROCODE FOR 64x 64-BIT SIGNED MULTIPLICATION WITH ONE COMPLETE 
EXTENDED MULTIPLICATION SHOWN IN THE UNSHADED CYCLES 
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Am29325 


32-Bit Floating Point Processor 
PRELIMINARY 


DISTINCTIVE CHARACTERISTICS 


Single VLSI device performs high-speed floating-point e |EEE and DEC formats 

arithmetic — Performs conversions between formats 

— Floating-point addition, subtraction and multiplication — Performs integer <—» floating point conversions 
in a single clock cycle Six flags indicate operation status 

— Internal architecture supports sum-of-products, Register enables eliminate clock skew 
Newton-Raphson division Input and output registers can be made transparent 

32-bit, 3-bus flow-through architecture independently 

— Programmable I/O allows interface to 32- and 16-bit 
systems 


GENERAL DESCRIPTION 


The Am29325 is a high-speed floating-point processor unit. high I/O bandwidth, allows access to all buses and affords a 
It performs 32-bit single-precision floating-point addition, high degree of flexibility wnen connecting this device in a 
subtraction, and multiplication operations in a single LSI system. All buses are registered, with each register having a 
integrated circuit, using the format specified by the proposed clock enable. Input and output registers may be made trans- 
IEEE floating-point standard P754. The DEC single- parent independently. Two other !/O configurations, a 32-bit, 
precision floating-point format is also supported. Operations 2-bus architecture and a 16-bit, 3-bus architecture, are 
for conversion between 32-bit integer format and floating- user-selectable, easing interface with a wide variety of sys- 
point format are available, as are operations for converting tems. Thirty-two-bit internal feedforward data paths support 
between the IEEE and DEC floating-point formats. Any op- accumulation operations, including sum-of-products and 
eration can be performed in a single clock cycle. Six flags — Newton-Raphson division. 

invalid operation, inexact result, zero, not-a-number, over- 
flow, and underflow — monitor the status of operations. 


Fabricated with the high-speed IMOX™ bipolar process, the 
The Am29325 has a 3-bus, 32-bit architecture, with two Am29325 is powered by a single 5-volt supply. The device is 
input buses and one output bus. This configuration provides housed in a 144-pin pin-grid-array package. 
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RELATED PRODUCTS 


@ Am29323 — 32 x 32 Parallel Multiplier e Am29334 — 64 x 18 Four-Port Dual-Access 
@ Am29332 — 32-Bit ALU Register File 
@ Am29331 — 16-Bit Sequencer 





IMOX is a trademark of Advanced Micro Devices, Inc. 51 Order # 05621B 


BLOCK DIAGRAM 
Am29325 


Ro— R34 


REGISTER 
R 


PORT R 
1 


ck (> *+— 


SELECT 
AND ENABLE 
LINES 


DEFINITION OF TERMS 


AFFINE MODE 


One of two modes affecting the handling of operations on 
infinities — see the Operations with Infinities section under 
Operation in IEEE Mode below. 


BIASED EXPONENT 


The true exponent of a floating-point number, plus a constant. 
For IEEE floating-point numbers, the constant is 127; for DEC 
floating-point numbers, the constant is 128. See also True 
Exponent. 


BUS 
Data input or output channel for the floating-point processor. 


DEC RESERVED OPERAND 


A DEC floating-point number that is interpreted as a symbol and 
has no numeric value. A DEC reserved operand has a sign of 1 
and a biased exponent of 0. 


DESTINATION FORMAT 


The format of the final result produced by the floating-point ALU. 
The destination format can be IEEE floating-point, DEC floating- 
point or integer. 


FLOATING-POINT 


PORT F 


REGISTER 
F 


REGISTER 
Ss 


PORTS 


STATUS 
FLAG 
GENERATOR 


STATUS FLAG 
REGISTER 


[> INEXACT 
[> INVALID 
[> NAN 
[| > OVERFLOW 
[_> UNDERFLOW 


[_> ZERO 
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FINAL RESULT 
The result produced by the floating-point ALU. 


FRACTION 
The twenty-three least-significant bits of the mantissa. 


INFINITELY PRECISE RESULT 


The result that would be obtained from an operation if both 
exponent range and precision were unbounded. 


INPUT OPERANDS 


The value or values on which an operation is performed. For 
example, the addition 2 + 3 = 5 has input operands 2 and 3. 


MANTISSA 


The portion of a floating-point number containing the number's 
significant bits. For the floating-point number 1.101 x 2-3, the 
mantissa is 1.101. 
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DEFINITION OF TERMS (Cont) 


NAN (Not-a-Number) 


An IEEE floating-point number that is interpreted as a symbol, 
and has no numeric value. A NAN has a biased exponent of 
255109 and a non-zero fraction. 


PORT 


Data input or output channel for the floating-point ALU. 


PROJECTIVE MODE 


One of two modes affecting the handling of operations on 
infinities — see the Operations with Infinities section under 
Operation in IEEE Mode below. 


PIN DESCRIPTION 


Ro—Ra1 


So- S31 


Fo—F31 


R operand bus, input. Ro is the least-signifi- 
cant bit. 


S operand bus, input. So is the least-signifi- 
cant bit. 


F operand bus, output. Fo is the least- 
significant bit. 


Clock input for the internal registers. 


Register R clock enable, input. When ENR is - 


LOW, register R is clocked on the LOW-to- 
HIGH transition of CLK. When ENR is HIGH, 
register R retains the previous contents. 


Register S clock enable, input. When ENS is 
LOW, register S is clocked on the LOW-to- 
HIGH transition of CLK. When ENS is HIGH, 
register S retains the previous contents. 


Register F clock enable, input. When ENF is 
LOW, register F is clocked on the LOW-to- 
HIGH transition of CLK. When ENF is HIGH, 
register F retains the previous contents. 


Input register feedthrough control, input. 
When FTo is HIGH, registers R and S are 
transparent. 


Output register feedthrough control, input. 
When FT, is HIGH, register F and the status 
flag register are transparent. 


Operation select lines, inputs. Used to select 
the operation to be performed by the ALU. See 
the ALU Operation Select Table for a list of 
operations and the corresponding codes. 


ALU S port input select, input. A LOW on Ig 
selects register S as the input to the ALU S 
port. A HIGH on lg selects register F as the 
input to the ALU S port. 


ROUNDED RESULT 


The result produced by rounding the infinitely precise result to fit 
the destination format. 


TRUE EXPONENT (or Exponent) 


Number representing the power of two by which a floating-point 
number's mantissa is to be multiplied. For the floating-point 
number 1.101 x 2-3, the true exponent is —3. 


IEEE/DEC 


INEXACT 


INVALID 


ONEBUS 


OVERFLOW 


PROJ/AFF 


Register R input select, input. A LOW on |4 
selects Ro—R3 1 as the input to register R. A 
HIGH selects the ALU F port as the input to 
register R. 


IEEE/DEC mode select, input. When !EEE/ 
DEC is HIGH, IEEE mode is selected. When 
IEEE/DEC is LOW, DEC mode is selected. 


Inexact result flag, output. A HIGH indicates 
that the final result of the last operation was not 
infinitely precise, due to rounding. 


Invalid operation flag, output. A HIGH indi- 
cates that the last operation performed was 
invalid, e.g., « times 0. 


Not-a-number flag, output. A HIGH indicates 
that the final result produced by the last opera- 
tion is not to be interpreted as a number. The 
output in such cases is either an IEEE Not-a- 
Number (NAN) or a DEC reserved operand. 


Output enable, input. When OE is LOW, the 
contents of register F are placed on Fo—F3}1. 
When OE is HIGH, Fg—F31 assume a high- 
impedance state. 


Input bus configuration control, input. A LOW 
on ONEBUS configures the input bus circuitry 
for two-input bus operation. A HIGH on 
ONEBUS configures the input bus circuitry for 
single-input bus operation. 


Overflow flag, output. A HIGH indicates that 
the last operation produced a final result that 
overflowed the floating-point format. 


Projective/affine mode select, input. Choice of 
projective or affine mode determines the way 
in which infinities are handled in IEEE mode. A 
LOW on PROUJ/AFF selects“affine mode; a 
HIGH selects projective mode. 





PIN DESCRIPTION (Cont) 


RNDo, RND, Rounding mode selects, inputs. RNDo and 
RNDy select one of four rounding modes. See 
the Rounding Mode Select Table for a list of 
rounding modes and the corresponding con- 


trol codes. 


Sixteen- or thirty-two-bit I/O mode select, 
input. A LOW on $16/32 selects the thirty-two- 
bit /O mode; a HIGH selects the sixteen-bit I/O 
mode. In thirty-two-bit mode, inputs and out- 
put buses are 32 bits wide. In sixteen-bit 
mode, input and output buses are sixteen bits 


ARCHITECTURE 


The Am29325 comprises a high-speed, floating-point ALU, a 
status flag generator, and a 32-bit data path. 


Floating-Point ALU 


The floating-point ALU performs 32-bit floating-point operations. 
It also’ performs floating-point-to-integer conversions, integer- 
to-floating-point conversions, and conversions between the 
IEEE and DEC floating-point formats. The ALU has two 32-bit 
input ports, R and S, and a 32-bit output port, F. 


Conceptually, the process performed by the ALU can be divided 
into three stages — see Figure 1. The operation stage performs 
the arithmetic operation selected by the user; the output of this 
section is referred to as the infinitely precise result of the opera- 
tion. The rounding stage rounds the infinitely precise result to fit in 
the destination format; the output of this stage is called the 
rounded result. The last stage checks for exceptional conditions. 
If no exceptional condition is found, the rounded result is passed 
through this stage. If some exceptional condition is found, e.g., 
overflow, underflow, or an invalid operation, this section may 
replace the rounded result with another output, such as +=, —~, 
a NAN, or a DEC reserved operand. The output of this last stage 
appears on port F, and is called the final result. 


The ALU performs one of eight operations; the operation to be 
performed is selected by placing the appropriate control code on 
lines lg—1p. The ALU Operation Select Table gives the control 
codes corresponding to each of the eight operations. 


The floating-point addition operation (R PLUS S) adds the 
floating-point numbers on ports R and S, and places the 
floating-point result on port F. In IEEE mode (IEEE/DEC = HIGH) 
the addition is performed in IEEE floating-point format; in DEC 
mode (IEEE/DEC = LOW) the addition is performed in DEC 
format. 


The floating-point subtraction operation (R MINUS S) subtracts 
the floating-point number on port S from the floating-point 
number on port R and places the floating-point result on port F. In 
{EEE mode (IEEE/DEC = HIGH) the subtraction is performed in 
IEEE floating-point format; in DEC mode (IEEE/DEC = LOW) the 
subtraction is performed in DEC format. 


The floating-point multiplication operation (R TIMES S) multiplies 
the floating-point numbers on ports R and S, and places the 
floating-point result on port F. In IEEE mode (IEEE/DEC = HIGH) 


Am29325 


wide, with the least and most significant por- 
tions of the thirty-two-bit input and output 
words being placed on the buses during the 
HIGH and LOW portions of CLK, respectively . 
UNDERFLOW _ Underflow flag, output. A HIGH indicates that 
the last operation produced a rounded result 
that underflowed the floating-point format. 


Zero flag, output. A HIGH indicates that the last 
operation produced a final result of zero. 


Figure 1. Conceptual Mode! of the Process Performed by 
the Floating-Point ALU 
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the multiplication is performed in IEEE floating-point format; in 
DEC mode (IEEE/DEC = LOW) the multiplication is performed in 
DEC format. 


The floating-point constant subtraction (2 MINUS S) operation 
subtracts the floating-point value on port S from 2, and places the 
result on port F. The operand on port R is not used in this 
operation; its value will not affect the operation in any way. !n 
IEEE mode (IEEE/DEC = HIGH) the operation is performed in 
IEEE floating-point format; in DEC mode (IEEE/DEC = LOW) the 
operation is performed in DEC format. This operation is used to 
support Newton-Raphson floating-point division; a description of 
its use appears in Appendix C. 


The integer-to-floating-point conversion (INT-TO-FP) operation 
takes a 32-bit, two’s complement integer on port R and places the 
equivalent floating-point value on port F. The operand on port Sis 
not used in this operation; its value will not affect the operation in 
any way. In IEEE mode (IEEE/DEC = HIGH) the result is de- 
livered in IEEE format; in DEC mode (IEEE/DEC = LOW) 
the result is delivered in DEC format. 
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ALU OPERATION SELECT TABLE 


Operation Output Equation 


Floating-point addition 
(R PLUS S) 


Floating-point subtraction 
(R MINUS S) 


Floating-point multiplication 
(R TIMES S) 


Floating-point constant 
subtraction (2 MINUS S) 
Integer-to-floating-point 
conversion (INT-TO-FP) 
Floating-point-to-integer 
conversion (FP-TO-INT) 


F (floating-point) = R (integer) 


F (integer) = R (floating-point) 





(IEEE-TO-DEC) 


IEEE-TO-DEC format conversion |F (DEC format) = R (IEEE format) 





DEC-TO-IEEE format conversion |F (IEEE format) = R (DEC format) 


(DEC-TO-IEEE) 


The floating- point-to integer conversion (FP-TO-INT) operation 
takes a floating-point number on port R and places the equivalent 
32-bit, two's complement integer value on port F. The operand on 
port S is not used in this operation; its value will not affect the 
operation in any way. In IEEE mode (IEEE/DEC = HIGH) the 
operand on port R is interpreted using the IEEE floating-point 
format; in DEC mode (IEEE/DEC = LOW) it is interpreted using 
the DEC floating-point format. 


The IEEE-to-DEC conversion operation (IEEE-TO-DEC) takes 
an IEEE-format floating-point number on port R and places the 
equivalent DEC-format floating-point number on port F. The 
operand on port S is not used in this operation; its value will not 
affect the operation in any way. The operation can be pertormed 
in either IEEE mode (IEEE/DEC = HIGH) or DEC mode (IEEE/ 
DEC = LOW). 


The DEC-to-IEEE conversion operation (DEC-TO-IEEE) takes 
a DEC-format floating-point number on port R and places the 
equivalent IEEE-format floating-point number on port F. The 
operand on port S is not used in this operation; its value will not 
affect the operation in any way. The operation can be performed 
in either IEEE mode (IEEE/DEC = HIGH) or DEC mode (IEEE/ 
DEC = LOW). 


Status Flag Generator 


The status flag generator controls the state of six flags that report 
the status of floating-point ALU operations. The flags indicate 
when an operation is invalid (e.g., infinity times zero) or when an 
operation has produced an overflow, an underflow, a non- 
numerical result (e.g., a NAN or DEC reserved operand), an 
inexact result, or a result of zero. The flags represent the status of 
the most-recently-performed operation. Flag status is stored in 
the flag status register on the LOW-to-HIGH transition of CLK. 
When the output register feedthrough control FT; is HIGH, the 
flag status register is made transparent. 


Data Path 


The 32-bit data path consists of the R and S input buses, the F 
output bus, data registers R, S, and F, the register R input multi- 
plexer, and the ALU port S input multiplexer. 


Input operands enter the floating-point processor through the 
32-bit R and S input buses, Rg—R3; and Sg—S31. Results 
of operations appear on the 32-bit F bus, Fo—F31. The F 
bus assumes a high-impedance state when output enable 
OE is HIGH. 


The R and S registers store input operands; the F register stores 
the final result of the floating-point ALU operation. Each register 
has an independent clock enable (ENR, ENS and ENF). When a 
register’s clock enable is LOW, the register stores the data on its 
input at the LOW-to-HIGH transition of CLK; when the clock 
enable is HIGH, the register retains its current data. All data 
registers are fully edge-triggered — both the input data and the 
register enable need only meet modest setup and hold time 
requirements. Registers R and S can be made transparent by 
setting FTg, the input register feedthrough control, HIGH. Regis- 
ter F can be made transparent by setting FT, the output register 
feedthrough control, HIGH. 


The register R input multiplexer selects either the R input bus or 
the floating-point ALU's F port as the input to register R. Selection 
is controlled by |, -— a LOW selects the R input bus; a HIGH 
selects the ALU F port. The ALU port S input multiplexer selects 
either register S or register F as the input to the floating-point 
ALU’s S port. Selection is controlled by lz - a LOW selects 
register S; a HIGH selects register F. 


Data selected by lg and I4 is described in the Mux Select Tables. 
When registers R and S are transparent (FTg = HIGH) multi- 
plexer select l4 must be kept LOW, so that the register R input 
multiplexer selects Ro— R31. When register F is transparent (FT, 
= HIGH) multiplexer select lz must be kept LOW, so that the ALU 
port S input multiplexer selects register S. 


MUX SELECT TABLES 


Data selected for floating-point ALU S port 








Register S 





Register F 





Data selected for register R input 
R bus 
Floating-point ALU port F 











1/O MODES 


The Am29325 data path can be configured in one of three I/O 
modes: a32-bit, two-input-bus mode; a 32-bit, single-input-bus 
mode; and a 16-bit, two-input-bus mode. These modes affect 
only the manner in which data is delivered to and taken from the 
Am29325; operation of the floating-point ALU is not altered. The 
I/O mode is selected with the ONEBUS and $16/32 controls. The 
1/O Mode Selection Table lists the control codes needed to 
invoke each I/O mode. 


/O MODE SELECTION TABLE 


$16/32 ONEBUS /O Mode 


32-bit, two-input-bus mode 


0 0 
0 
1 
1 


1 32-bit, single-input-bus mode(+) 
0 16-bit, two-input-bus mode(+) 
1 \llegal 1/0 mode selection value 


(*)FTo must be held LOW in this mode (see text). 
32-Bit, Two-Input-Bus Mode 


In this |/O mode, the R and S buses are configured as indepen- 
dent 32-bit input buses, and the F bus is configured as a 32-bit 
output bus. Figure 2 is a functional block diagram of the Am29325 
in this /O mode. 
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Rand S operands are taken from their respective input buses and 
clocked into the R and S registers on the LOW-to-HIGH transition 
of CLK. Register F is also clocked on the LOW-to-HIGH transition 
of CLK. Figure 5(a.) depicts typical I/O timing in this mode. 


32-Bit, Single-Input-Bus Mode 


i tac +. Inala 
In this 1/O mode, the R and S buses aré connected to a single 


32-bit multiplexed input data bus; the F bus is configured as an 
independent 32-bit output bus. Figure 3 is a functional block 
diagram of the Am29325 in this I/O mode. Note that both the R 
and S bus lines must be wired to the input bus. 


R and S operands are multiplexed onto the input bus by the host 
system. The S operand is clocked from the input bus into a 
temporary holding register on the HIGH-to-LOW transition of 
CLK and is transferred to register S on the LOW-to-HIGH transi- 
tion of CLK. The R operand is clocked from the input bus into 
register R on the LOW-to-HIGH transition of CLK. Register F is 
clocked on the LOW-to-HIGH transition of CLK. Figure 5(b.) 
depicts typical I/O timing in this mode. 


When placed in this I/O mode, the data path will not function 
properly if the R and S registers are made transparent. Therefore 
input register feedthrough contro! FTg must be held LOW in this 
mode. 


Figure 2. Functional Block Diagram for the 32-Bit, Two-Input-Bus Mode 
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Figure 3. Functional Block Diagram for the 32-Bit, Single-Input-Bus Mode 
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16-Bit, Two-Input-Bus Mode 


In this /O mode, the R and S buses are configured as indepen- 
dent 16-bit input buses, and the F bus is configured as a 16-bit 
output bus. Figure 4 is a functional block diagram of the Am29325 
in this 1/0 mode. Note that the 16 LSBs and 16 MSBs of the R, S 
and F buses must be wired to their respective system buses in 
parallel. 


Thirty-two-bit operands are passed along the 16-bit data buses 
by time-multiplexing the 16 LSBs and 16 MSBs of each 32-bit 
word. For the R input bus, the host system multiplexes the 16 
LSBs and 16 MSBs of the R operand onto the 16-bit R bus. The 16 
LSBs of the R operand are stored in a temporary holding register 
onthe HIGH-to-LOW transition of CLK. The 16 MSBs are clocked 
into register R on the LOW-to-HIGH transition of CLK; at the 
same time, the 16 LSBs are transferred from the temporary 


holding register to register R. Transfer of data from the S input bus 
to the S register takes place in a similar fashion. Register F is 
clocked on the LOW-to-HIGH transition of CLK. Circuitry internal 
to the Am29325 multiplexes data from register F onto the 16-bit 
output bus by enabling the 16 LSBs of the F output bus when CLK 
is HIGH, and enabling the 16 MSBs of the F output bus when CLK 
is LOW. Figure 5(c.) depicts typical I/O timing in this mode. 


When placed in this I/O mode, the data path will not function 
properly if the R and S registers are made transparent. Therefore 
input register feedthrough control FTg must be held LOW in this 
mode. Caution must also be taken in controlling the register R 
input multiplexer control line, 14, in this /O mode. l4 should be 
changed only when CLK is HIGH, in addition to meeting the setup 
and hold time requirements given in the Switching Characteris- 
tics section. 
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OPERATION IN IEEE MODE 


When input signal IEEE/DEC is HIGH, the IEEE mode of opera- 
tionis selected. In this mode the Am29325 uses the floating-point 
format set forth in the IEEE Proposed Standard for Binary 
Floating-Point Arithmetic, P754. In addition, the IEEE mode 
complies with most other aspects of single-precision floating- 
point operation outlined in the proposed standard — differences 
are discussed in Appendix A. 


IEEE Floating-Point Format 


The IEEE single-precision floating-point word is thirty-two bits 
wide, and is arranged in the format shown in Figure 6. The 
floating-point word is divided into three fields: a single-bit sign, 
an eight-bit biased exponent, and a 23-bit fraction. 


The sign bit indicates the sign of the floating-point number's 
value. Non-negative values have a sign of 0; negative values, a 
sign of 1. The value zero may have either sign. 


The biased exponent is an eight-bit unsigned integer field repre- 
senting a multiplicative factor of some power of two. The bias 
value is 127. If, for example, the multiplicative factor for a 
floating-point number is to be 24, the value of the biased expo- 
nent would be a+ 127; a is called the true exponent. 


The fraction is a 23-bit unsigned fractional field containing the 23 
least-significant bits of the floating-point number's 24-bit man- 
tissa. The weight of fraction's most significant bit is 2-1; the 
weight of the least-significant bit is 2—23. 
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Figure 4. Functional Block Diagram for the 16-Bit, Two-Input-Bus Mode 
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A floating-point number is evaluated or interpreted per the fol- 
lowing conventions: 


let s = sign bii 
e = biased exponent 
f = fraction 


ife = Oandf =0...value = (-1)§*(0) (+0, —0) 
ife = O andf +O... value = denormalized number 


if0<e< 255.. value = (—1)S+(2¢~127).(1 f) 
(normalized number) 


_ife = 255 andf = 0 .. value = (—1)S*(x) (+2, —x) 
ife = 255 andf +0... value = not-a-number (NAN) 


Zero — The value zero can have either a positive or negative sign. 
Rules for determining the sign of a zero produced by an operation 
are given in the Sign Bit section on page 12. 


Denormalized Number — A denormalized number represents a 
quantity with magnitude less than 2— 126 but greater than zero. 


Ss 


056218-6 


Normalized Number — A normalized number represents a 
quantity with magnitude greater than or equal to 2~ 126 but less 
than 2128. 


Example 1: 


The number +3.5 can be represented in floating-point format 
as follows: 


+3.5 = 11.19 x 20 
= 1.119 x 2! 


sign = 0 


biased exponent = 149+12749 = 12849 
= 100000002 


fraction = 110000000000000000000002 
(the leading 1 is implied in the format) 


Concatenating these fields produces the floating-point word 
4060000046. 
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Figure 5. Typical Bus Timing for the I/O Modes, with FTg = LOW, FT, = LOW 
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a) 32-Bit, Two-Input-Bus Mode 





c) 16-Bit, Two-input-Bus Mode 
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Figure 6. IEEE Mode Single-Precision Floating-Point Format 


SIGN BIASED 
BIT (S) 


EXPONENT (E) 


BIT NUMBER: 


FRACTION (F) 


2-19 9-20 2-21 2-22 9-23 





VALUE = (—1)S (2E-127) (1.F) 


Example 2: 


The number — 11.375 can be represented in floating-point for- 
mat as follows: 


~11.375 = —1011.0115 x 20 
= —1.0110119 x 23 


sign = 1 


biased exponent = 349+12719 = 13019 


100000105 
fraction = 011011000000000000000002 
(the leading 1 is implied in the format) 
Concatenating these fields produces the floating-point word 
C13600004¢. 


Infinity — Infinity can have either a positive or negative sign. The 
way in which infinities are interpreted is determined by the state of 
the projective/affine mode select, PROJ/AFF. 
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Not-a-Number — A not-a-number, or NAN, does not representa 
numeric value, but is interpreted as a signal or symbol. NANs are 
used to indicate invalid operations, and as a means of passing 
process status information through a series of calculations. NANs 
arise in two ways: they can be generated by the Am29325 to 
indicate that an invalid operation has taken place (e.g., infinity 
times zero), or they can be provided by the user as an input 
operand. There are two types of NANs: signalling and quiet. 
These NANs have the formats shown in Figure 7. 


IEEE Mode Integer Format 


Integer numbers are represented as 32-bit, two’s complement 
words; Figure 8 depicts the integer format. The integer word can 
represent a range of integer values from —231 to 231-4, 


Figure 7. Signalling and Quiet NAN Formats 


SIGN BIASED 
BIT EXPONENT 


FRACTION 


On a a 


31°30 29 28 #27 26 25 24 23 22 21 #20 19 


17° 16 «15 122 #11 10 9 8 7 6 5 4 3.2 «1 0 


31°30 29 28 27 26 25 24 23 22 21 20 19 


7 1% «15 «14«613«¢=«12=~CTE 0 8lUBlClUTLlUGCUCGU HL? 1 0 


oweroaw [x[o st tt fe ee ee eR EEE RN AREER S 


nN eT ETERS eneeaet 


X = DON'T CARE 


AT LEAST ONE OF THE 
TWENTY-TWO LSBs OF A QUIET NAN 
MUST BE 1 


Figure 8. Thirty-Two-Bit Integer Format 


BIT NUMBER: 31 3004 290« 28 = 27) 2685 24 


~231 230 929 228 927 
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Operations 


All eight floating-point ALU operations discussed in the Func- 
tional Description section above can be performed in IEEE mode. 
Various exceptional aspects of the R PLUS S, R MINUS S, R 
TIMES S, 2 MINUS S, INT-TO-FP, and FP-TO-INT operations 
for this mode are described below. The !EEE-TO-DEC and 
DEC-TO-IEEE operations are discussed separately in the 
IEEE-TO-DEC and DEC-TO-IEEE Operations section on 
page 23. 


Operations with NANs — NANs arise in two ways: they can be 
generated by the Am29325 to indicate that an invalid operation 
has taken place (e.g., infinity times zero), or they can be provided 
by the user as an input operand. There are two types of 
NANs: signalling and quiet. These NANs have the formats 
shown in Figure 7. 


Signalling NANs set the invalid operation flag when they appear 
as an input operand to an operation. They are useful for indicating 
uninitialized variables, or for implementing user-designed exten- 
sions to the operations provided. The ALU never produces a 
signalling NAN as the final result of an operation. 


Quiet NANs are generated for invalid operations. When they 
appear as an input operand, they are passed through most oper- 
ations without setting the invalid flag, the floating-point-to- 
integer conversion operation being the exception. 


The sign of any input operand NAN is ignored. All quiet NANs 
produced as the final result of an operation have a sign of 0. 


When a NAN appears as an input operand, the final result of the 
operation is a quiet NAN that is created by taking the input NAN 
and forcing bit 22 LOW and bit 21 HIGH. If an operation has two 
NANs as input operands, the resulting quiet NAN is created using 
the NAN on the R port. 


When a quiet NAN is produced as the final result of an invalid 
operation whose input operand or operands are not NANs, the 
resulting NAN will always have the value 7FA0000046. 


The NAN flag will be HIGH whenever an operation produces a 
NAN as a final result. 


Example 1: 


Suppose the floating-point addition operation is performed 
with the following input operands: 


R port: 3F8000004¢ (1.0+29) 
S port: 7FC1234546 (signalling NAN) 


Result: The signalling NAN on the S port is converted to 
a quiet NAN by forcing bit 22 LOW and bit 21 HIGH. 
The operation's final result will be 7FA123454¢. Since 
one of the two input operands is a signalling NAN, 
the invalid flag will be HIGH; the NAN flag will also 
be HIGH. 


Example 2: 
Suppose the floating-point multiplication operation is per- 
formed with the following input operands: 


R port: FFF1111146 (signalling NAN) 
S port: 7FC2222216 (quiet NAN) 


Result: Since both input operands are NANs, the NAN on the 
R port is chosen for output. In addition to forcing bit 22 
LOW, the sign bit (bit 31) is set LOW (bit 21 is already 
HIGH, and need not be changed). The operation's final 
result will be 7FB1111146. Since one of the two input 
operands is a signalling NAN, the invalid flag is HIGH; 
the NAN flag will also be HIGH. 


Example 3: 


Suppose the floating-point subtraction operation is performed 
with the following input operands: 


R port: FF8000011¢ (quiet NAN) 
S port: 7F8000004¢ (+~) 


Result: To create the final result, the quiet NANs sign bit (bit 
31) is forced LOW and bit 21 is forced HIGH (bit 22 is 
already LOW, and need not be changed). The final 
result will be 7—FAQ00014¢. The NAN flag will be HIGH. 


Operations with Denormalized Numbers — The proposed 
{EEE standard incorporates denormalized numbers to allow a 
means of gradual underflow for operations that produce non-zero 
results too small to be expressed as a normalized floating-point 
number. The Am29325 does not support gradual underflow. If a 
floating-point operation produces a non-zero rounded result that 
is not large enough to be expressed as a normalized floating- 
point number, the final result will be a zero of the same sign; the 
inexact, underflow, and zero flags will be HIGH. If an input 
operand is a denormalized number, the floating-point ALU will 
assume that operand to be a zero of the same sign. 


Operations Producing Overflows — If an operation has a finite 
input operand or operands, and if the operation produces a 
rounded result that is too large to fit in the destination format, that 
operation is said to have overflowed. 


A floating-point overflow occurs if an R PLUS S, R MINUS S, R 
TIMES S, or 2 MINUS S operation with finite input operand(s) 
produces a result which, after rounding, has a magnitude greater 
than or equal to 2128. Positive or negative infinity will appear as 
the final result if the rounded result is positive or negative, respec- 
tively, and the overflow and inexact flags will be HIGH. 


Integer overflow occurs when the fixed-to- floating-point conver- 
sion operation attempts to convert a number which, after round- 
ing, is greater than 231—1 or less than — 231. The final result will 
be quiet NAN 7FA00000j6, and the invalid operation and NAN 
flags will be HIGH. Note that the overflow and inexact flags 
remain LOW for integer overflow. 


Operations Producing Underflows — If an operation produces 
a floating-point rounded result having a magnitude too small to be 
expressed as a normalized floating-point number, but greater 
than zero, that operation is said to have underflowed. Underflow 
occurs when an R PLUS S, R MINUS S, or R TIMES S operation 
produces a result which, after rounding, has a magnitude in the 
range: 


0 < magnitude < 2—126, 


In such cases, the final result will be +0 (0000000046) if the 
rounded result is non-negative, and —0 (8000000046) if the 
rounded result is negative. The underflow, inexact, and zero flags 
will be HIGH. 


Underflow does not occur if the destination format is integer. If the 
infinitely precise result of a floating-point-to-integer conversion 
has a magnitude greater than 0 and less than 1 but the rounded 
result is 0, the underflow flag remains LOW. 


Operations with Infinities — In most cases, positive and nega- 
tive infinity are valid input arguments for the R PLUS S, R MINUS 
S,R TIMES S, and 2 MINUS S operations. Those cases for which 
infinities are not valid inputs for these operations are listed in the 
IEEE Mode Invalid Operations Table (see next page). 


Infinities in IEEE mode can be handled either as projective or 
affine. The projective mode is selected when PROJ/AFF is HIGH; 





the affine mode is selected when PROJ/AFF is LOW. The only 
differences between the modes that are relevant to Am29325 
operation occur during the addition and subtraction of infinities: 


Affine 
Mode 


Output +x 


Projective Mode 


Output 7FA000001¢ (quiet NAN), 
set invalid and NAN flags 


Output 7FA000001¢ (quiet NAN), 
set invalid and NAN flags 


Output 7FA000004¢ (quiet NAN), 
set invalid and NAN flags 


Output 7FA000001¢ (quiet NAN), 
set invalid and NAN flags 


Operation 








(-=)+(-») 





(+x)-(-~) Output +x 





IfanR PLUS S, R MINUS S, R TIMES §S, or 2 MINUS S operation 
has infinity as an input operand or operands, the final result, if 
valid, is presumed to be exact-For example, adding += and 2.0 
will produce a final result of +x; since the result is considered 
exact, the inexact flag remains LOW. 


Invalid Operations — If an input operand is invalid for the opera- 
tion to be performed, that operation is considered invalid. When 
an invalid operation is performed, the floating-point ALU pro- 
duces a quiet NAN as the final result, and the invalid operation 
flag goes HIGH. The IEEE Mode Invalid Operations Table lists 
the cases for which the invalid flag is HIGH in IEEE mode, and the 
final results produced for these operations. 


IEEE MODE INVALID OPERATIONS TABLE 
Operation Input Operand 
R PLUS S 


Final Result 


7FA00000146 
(quiet NAN) 


7FA0000016 
(quiet NAN) 


7FA00000 4, 
(quiet NAN) 


7FA0000046 
(quiet NAN) 


(+2) + (=) 
or (—x) + (+=) 
R PLUS S (+x) + (+=) 

or (—x) + (—x) (Note 1) 








R MINUS S (+20) — (+2) 


or (—x) — (—*) 
(+=) — (~) 

or (—2) — (+x) (Note 1) 
(+0) * (+=) 

or (+0) + (—<) 

or (—0) « (+x) 

or (-0) * (-=) 





R MINUS S 








R TIMES S 
7FA0000016 
(quiet NAN) 





R PLUS S 
R MINUS S 
R TIMES S 


2 MINUS S 
FP-TO-INT 
FP-TO-INT 


R or Sis a signalling NAN 
(Note 2) 











S is a signalling NAN (Note 2) 
(Note 2) 


7FA00000 16 
(quiet NAN) 








R is a signalling or quiet NAN 


R > 231-4 
or R < — (231) 











Notes: 1. These cases are invalid in projective mode only. 
2. Results for these operations are described in the Operations 
with NANSs section. 


The Sign Bit 


For most floating-point operations, the sign bit of the final result is 
unambiguous, i.e., there is only one sign bit value that yields a 
numerically correct result. Operations that produce an infinitely 
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precise result of zero, however, present a problem, as the IEEE 
floating-point format allows for representation of both +0 and 
—0. The following rules can be used to determine the signs of 
zero produced in such cases: 


R PLUS S -— The operations +x + (—x) and —x + (+x) producea 
final result of zero; the sign of the zero is dependent on the 
rounding mode: 


Rounding Mode Sign of Final Result 


Round to nearest 








Round toward —=x 





Round toward-+= 








Round toward 0 


The operation +0 + (+0) produces a final result of +0; the 
operation —O + (—0) produces a final result of —0. 


R MINUS S — The operations +x — (+x) and —x — (—x) produce 
a final result of zero; the sign of the zero is dependent on the 
rounding mode: 


Rounding Mode Sign of Result 


Round to nearest 





Round toward —= 


Round toward 0 


The operation +0 — (—0) produces a final result of +0; the 
operation —O — (+0) produces a final result of —0. 


R TIMES S — The sign of any multiplication result other than a 
NAN is the exclusive-OR of the signs of the input operands. 
Therefore, if x is non-negative, 


+0 times +x produces a final result of +0, 
+0 times —x produces a final result of —0, 
—0O times +x produces a final result of —O, 
—0 times —x produces a final result of +0. 


2 MINUS S — If S equals 2, the final result is —0O for the round 
toward —= mode, and +0 for all other rounding modes. 


Rounding 


Rounding is performed whenever an operation produces an infi- 
nitely precise result that cannot be represented exactly in the 
destination format. For example, suppose a floating-point opera- 
tion produces the infinitely precise result 


1.10101010101010101010101\01 x 23. 


In this example, the fraction portion of the mantissa has twenty- 
five bits; the IEEE floating-point format can accommodate only 
twenty-three. The backslash (\) in the mantissa represents the 
boundary between the first twenty-three bits of the fraction and 
any remaining bits. Rounding is the process by which this resultis 
approximated by a representation that fits the destination format. 
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There are four rounding modes in IEEE mode: round to nearest, 
round toward +~, round toward —~, and round toward 0. The 
rounding mode is chosen using the rounding mode select lines, 
RNDog and RND;. The Rounding Mode Select Table lists the 
select states needed to obtain the desired rounding mode. 


ROUNDING MODE SELECT TABLE 


RNDo Rounding Mode 


Round to nearest 





Round toward —~= 





Round toward += 





Round toward 0 


Round to Nearest — In this rounding mode the infinitely precise 
result of an operation is rounded to the closest representation that 
fits in the destination format. If the infinitely precise result is 
exactly halfway between two representations, it is rounded to the 
representation having an LSB of zero. Rounding is performed 
both for floating-point and integer destination formats. 
Figure 9 illustrates four examples of the round to nearest process 
for operations having a floating-point destination format. The 
infinitely precise result of an operation is represented by an X on 
the number line; the black dots on the number line indicate 
those values that can be represented exactly in the floating-point 
format. 
Example 1: 

In Figure 9(a), the infinitely precise result of an operation is: 


220+2-4+2-5 = 1.00000000000000000000000 11 x 220. 


The result is rounded to the closest representable floating-point 
value, 


220+2-3 = 4.00000000000000000000001 x 220. 


Example 2: 
In Figure 9(b), the infinitely precise result of an operation is: 
220—-2-44.2-8 = 4.11111111111111111111111 0001 x 219. 


This result is rounded to the closest representable floating-point 
value, 


220—2-4 = 4.111111491919141111111111 x 219. 


Example 3: 
In Figure 9(c), the infinitely precise result of an operation is: 


—(22042-342-4) 
= —1,00000000000000000000001\1 x 220. 


This result is exactly halfway between two representable 
floating-point values. Accordingly, it is rounded to the closest 
representation with an LSB of zero, or 


—(220+2+2-3) = —1.00000000000000000000010 x 220. 


Example 4: 
In Figure 9(d), the infinitely precise result of an operation is: 
220+3+2-3 = 1,00000000000000000000011 x 220. 


This result can be represented exactly in the floating-point 
format, and is left unaltered by the rounding process. 


Figure 9. Floating-Point Rounding Examples for Round to Nearest Mode 


(220 -3° 2-4) 
-(220 be 2-4) 


| 
(220 ~2° 2-4) 


I 
—(220 + 2-3) | 
—(220) 


l 
—(220 + 3° 2-3) | 
—(220 + 2° 2-3) 


ROUND TO 220 + 2-3 


| I 
220-2+2-4 | 2204 2-3 220+ 3°2-3 
220 2204 2°2-3 


ROUND TO 220 — 2-4 220 + 2-44 2-5 


9 0 0 


ROUND TO —(220 + 2-3) 


220 _ 9-44 9-8 


Se gees i ee 


—(220 + 2-3 + 2-4) 


NO CHANGE 
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Figure 10 illustrates four examples of the round to nearest 
process for operations having an integer destination format. The 
infinitely precise result of an operation is represented by an X on 
the number line; the black dots on the number line indicate those 
values that can be represented exactly in the integer format. 


Example 1: 
In Figure 10(a), the infinitely precise result of an operation is: 
210—2-2 = 00...001111111111.11. 
The resuitis rounded to the ciosest representabie integer vaiue, 
210 = Q0...010000000000. 


Example 2: 


In Figure 10(b), the infinitely precise result of an operation is: 
210+20+2-3 = 00...010000000001.001. 


This result is rounded to the closest representable floating-point 
value, 


210+20 = 00...010000000001. 
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Example 3: 
In Figure 10(c), the infinitely precise result of an operation is: 
—(2104+204+2-1) = 14...101111111110.1. 


This result is exactly halfway between two representable integer 
values. Accordingly, it is rounded to the closest representation 
with an LSB of zero, or 


— (21042420) = 14...101111111110. 


Example 4: 
In Figure 10(d), the infinitely precise result of an operation is: 
210+3+20 = 00...010000000011. 


This result can be represented exactly in the integer format, and 
is left unaltered by the rounding process. 


Figure 10. Integer Rounding Examples for Round to Nearest Mode 


—(210 + 3) (2994 2) (210 + 1) —(210) —(210 — 1) 


ROUND TO 210 


| | | | | 
210 _ 4 4 210 20044 200492 21043 
210 _ 9-2 


ROUND TO 210 + 4 


i 


ROUND TO —(210 + 2) 


a 


210 + 204 2-3 


—+-—___+_{_» —__» ____-» —____|__-___»____»____» ____» —___o — 
{ 


—(219 + 20 4 2-1) 





NO CHANGE 
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Round Toward —< — In this rounding mode the result of an 
operation is rounded to the closest representation that is less than 
or equal to the infinitely precise result, and which fits the destina- 
tion format. Rounding is performed both for floating-point and 
integer destination formats. 


Figure 11 illustrates four examples of the round toward ~~ pro- 
cess for operations having a floating-point destination format. 
The infinitely precise result of an operation is represented by an X 
on the number line; the black dots on the number line indicate 
those values that can be represented exactly in the floating-point 
format. 


Example 1: 
In Figure 11(a), the infinitely precise result of an operation is: 
2204+2-4+4+2-5 = 1,00000000000000000000000\11 x 220. 


This result cannot be represented exactly in floating-point 
format, and is rounded to the next-smaller floating-point repre- 
sentation: 


220 = 1,00000000000000000000000 x 220. 


Example 2: 
In Figure 11(b), the infinitely precise result of an operation is: 
220-2-44.2-8 = 1.11111111111111111111111\0001 x 219, 


This result cannot be represented exactly in floating-point for- 
mat, and is rounded to the next-smaller floating-point rep- 
resentation: 


220—2-4 = 1.11111111111111111111111 x 219, 


Example 3: 
In Figure 11(c), the infinitely precise result of an operation is: 


—(220+.2-3+42-4) 
= —1,00000000000000000000001M1 x 220. 


This result cannot be represented exactly in floating-point 
format, and is rounded to the next-smaller floating-point 
representation: 


—(220+2+2-3) = —1.00000000000000000000010 x 220. 


Example 4: 


In Figure 11(d), the infinitely precise result of an operation is: 
220+3*2-3 = 1.00000000000000000000011 x 220. 


This result can be represented exactly in the floating-point 
format, and is left unaltered by the rounding process. 





Figure 11. Floating-Point Rounding Examples for Round Toward — & Mode 


-(220 -3° 2-4) 
—(220 cs 2-4) 










ROUND TO 220 












| I | | | 
—(220 + 2-3) —(220 ~ 2° 2-4) 0 220- 2+ 2-4 | 2204 2-3 2204 3°2-3 
—(220) a) 220 2204 2+*2-3 


ROUND TO 220 — 2-4 220 4 2-44 9-5 


09 6 5 ne fe pe oe 


0 
ROUND TO -(220 + 2* 2-3, b) 


(2 ee een fe tae een 


0 


—(220 + 3° 2-3) | 
—(220 + 2+ 2-3) 


220 _ 2-44 9-8 


NO CHANGE 


c) () 
0 | 


d) . 


—(220 + 2-3 + 2-4) 


220 43° 2-3 


05621A-13 


65 


Am29325 








Figure 12 illustrates four examples of the round toward —~ pro- This result is rounded to the next-smaller representable integer 
cess for operations having an integer destination format. The value, 
infinitely precise result of an operation is represented by an X on 210420 = 90...010000000001. 
the number line; the black dots on the number line indicate those 
values that can be exactly represented in the integer format. Example 3: 
Example 1: In Figure 12(c), the infinitely precise result of an operation is: 
In Figure 12(a), the infinitely precise result of an operation is: —(2104+20+2-1) = 14...101111111110.1. 
210-9-2 = 90...001111111111.11. This result is rounded to the next-smaller representable integer 
: value: 
The result is rounded to the next-smaller representable integer 
value, —(21042620) = 17... 7011 PF 111110. 
210-20 = 00...001111111111. Example 4: 
In Figure 12(d), the infinitely precise result of an operation is: 
BAamipe: 21043420 = 00...010000000011. 
In Figure 12(b), the infinitely precise result of an operation is: This result can be represented exactly in the integer format, and 
210+20+2-3 = 00...010000000001.001. is unaltered by the rounding process. 





Figure 12. Integer Rounding Examples for Round Toward — % Mode 
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Round Toward + — In this rounding mode the result of an 
operation is rounded to the closest representation that is greater 
than or equal to the infinitely precise result, and which fits the 
destination format. Rounding is performed both for floating-point 
and integer destination formats. 


Figure 13 illustrates four examples of the round toward +x 
process for operations having a floating-point destination 
format. The infinitely precise result of an operation is represented 
by an X on the number line; the black dots on the number line 
indicate those values that can be represented exactly in the 
floating-point format. 
Example 1: 
In Figure 13(a), the infinitely precise result of an operation is: 
2204+2-442-5 = 1.00000000000000000000000\11 x 220. 


This result cannot be represented exactly in floating-point 
format, and is rounded to the next-larger floating-point repre- 
sentation: 


220+2-3 = 1.00000000000000000000001 x 220. 

Example 2: 
In Figure 13(b), the infinitely precise result of an operation is: 
220—2-442-8 = 4.11111111111111111111111\0001 x 219, 


This result cannot be represented exactly in floating-point 
format, and is rounded to the next-larger floating-point repre- 
sentation: 


220 = 1.00000000000000000000000 x 220. 


Example 3: 
In Figure 13(c), the infinitely precise result of an operation is: 
—(220+2-342-4) 
= —1.00000000000000000000001\1 x 220, 


This result cannot be represented exactly in floating-point 
format, and is rounded to the next-larger floating-point repre- 
sentation: 


~—(220+2-3) = —1,00000000000000000000001 x 220. 


Example 4: 


In Figure 13(d), the infinitely precise result of an operation is: 
220+3+2-3 = 1,00000000000000000000011 x 220. 


This result can be represented exactly in the floating-point for- 
mat — no rounding takes place. 





Figure 13. Floating-Point Rounding Examples for Round Toward + © Mode 
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Figure 14 illustrates four examples of the round toward +~ pro- 
cess for operations having an integer destination format. The 
infinitely precise result of an operation is represented by an X on 
the number line; the black dots on the number line indicate those 
values that can be exactly represented in the integer format. 


Example 1: 
In Figure 14(a), the infinitely precise result of an operation is: 
2102-2 = 00...001111111111.11. 


The resuit is rounded to the next-larger representable integer 
value, 


210 = 00...010000000000. 

Example 2: 
In Figure 14(b), the infinitely precise result of an operation is: 
2104+2042-3 = 00...010000000001.001. 


Figure 14. Integer Rounding Examples for Round Toward + © Mode 
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This result is rounded to the next-larger representable integer 
value, 


210+2+20 = 00...010000000010. 

Example 3: 
In Figure 14(c), the infinitely precise result of an operation is: 
—(2104.204.9-1) = 14...101111911110.1 


This result is rounded to the next-larger representable integer 
value: 


—(210420) = 11...1011111111110. 

Example 4: 
In Figure 14(d), the infinitely precise result of an operation is: 
210+3+20 = Q0...010000000011. 


This result can be represented exactly in the integer format — no 
rounding takes place. 
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Round Toward 0 — In this rounding mode the result of an 
operation is rounded to the closest representation whose mag- 
nitude is less than or equal to the infinitely precise result, and 
which fits the destination format. Rounding is performed both for 
floating-point and integer destination formats. 

Figure 15 illustrates four examples of the round toward 0 process 
for operations having a floating-point destination format. The 
infinitely precise result of an operation is represented by an X on 
the number line; the black dots on the number line indicate those 
values that can be represented exactly in the floating-point 
format. 


Example 1: 
In Figure 15(a), the infinitely precise result of an operation is: 
220+2-4+2-5 = 1,00000000000000000000000\11 x 220, 


This result cannot be represented exactly in floating-point 
format, and is rounded to: 
220 = 1.00000000000000000000000 x 220. 


Example 2: 


In Figure 15(b), the infinitely precise result of an operation is: 
220-2-442-8 = 4.44111111111111111111111\001 x 219. 


This result cannot be represented exactly in floating-point 
format, and is rounded to: 


220-2-4 = 1.11111111111111111111111 x 219. 
Example 3: 
In Figure 15(c), the infinitely precise result of an operation is: 


—(220+2-342-4) 
= —1.00000000000000000000001M1 x 220. 


This result cannot be represented exactly in floating-point 
format, and is rounded to: 


—(220+2-3) = —4.00000000000000000000001 x 220. 
Example 4: 

In Figure 15(d), the infinitely precise result of an operation is: 

220+3+2-3 = 1.00000000000000000000011 x 220, 


This result can be represented exactly in the floating-point 
format, and is unaffected by the rounding process. 


Figure 15. Floating-Point Rounding Examples for Round Toward 0 Mode 


—(220 — 3° 2-4) 
-(220 S al ] 
| I I | | | 


—(220 4 2-3) | —(220 —~2° 274) 


— (220) 


—(220 + 3° 2-3) | 
—(220 + 2* 2-3) 


ROUND TO ~(220 + 2-3) 


220 _ 2-4 


| ROUND TO 220 


220_2°2-4 | 220 + 2-3 | 2204 3°2-3 
220 2204 2° 2-3 
ROUND TO 220 - 24 220 4 2-44 2-5 


PL: 2 SR er eee ae eee a 


—(220 + 2-3 + 2-4) 


NO CHANGE 
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Figure 16 illustrates four examples of the round toward 0 process 
for operations having an integer destination format. The infinitely 
precise result of an operation is represented by an X on the 
number line; the black dots on the number line indicate those 
values that can be exactly represented in the integer format. 
Example 1: 

In Figure 16(a), the infinitely precise result of an operation is: 

210-2-2 = 0O...001111111111.11. 

The resutt is rounded to: 

210-20 = 00...001111111111. 
Example 2: 

In Figure 16(b), the infinitely precise result of an operation is: 

210+20+2-3 = Q0...010000000001.001. 

The result is rounded to: 

210420 = Q0.,.010000000001. 
Example 3: 

In Figure 16(c), the infinitely precise result of an operation is: 

—(2104+204+2-1) = 11,..101111111110.1. 

This result is rounded to: 

—(210+20) = 11,..101111111111. 
Example 4: 


In Figure 16(d), the infinitely precise result of an operation is: 
210+3+20 = Q0...010000000011. 


This result can be represented exactly in the integer format, and 
is unaffected by the rounding process. 
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Flag Operation 


The Am29325 generates six status flags to monitor floating-point 
processor operation. The following is a summary of flag conven- 
tions in IEEE mode: 


Invalid Operation Flag — The invalid operation flag is HIGH 
when an input operand is invalid for the operation to be per- 
formed. The IEEE Mode Invalid Operations Table on page 12 
lists the cases for which the invalid operation flag is HIGH in IEEE 
mode, and the corresponding final result. In cases where the 
invalid operation flag is HIGH, the overflow, underflow, zero, and 
inexact flags are LOW; the NAN flag will be HIGH. 


Overflow Flag — The overflow flag is HIGH if an R PLUS S, 
R MINUS S, R TIMES §S, or 2 MINUS S operation with finite in- 
put operand(s) produces a result which, after rounding, has a 
magnitude greater than or equal to 2128. The final result will be 
+x or -%. 


Underflow Flag — The underflow flag is HIGH ifan R PLUS S, R 
MINUS S, or R TIMES S operation produces a result which, after 
rounding, has a magnitude in the range: 


0 < magnitude < 2-126, 


The final result will be +0 (0000000016) if the rounded result is 
non-negative, and —O (800000006) if the rounded result is 
negative. 


Inexact Flag — The inexact flag is HIGH if the final result of an R 
PLUS S, R MINUS S, R TIMES S, 2 MINUS S, INT-TO-FP, or 
FP-TO-INT operation is not equal to the infinitely precise result. 
Note that if the underflow or overflow flag is HIGH, the inexact flag 
will also be HIGH. 


Figure 16. Integer Rounding Examples for Round Toward 0 Mode 


—(2104 3) ~(210 + 2) ~(219 + 1) ~ (210) —(210 ~ 1) 


ROUND TO -(210 + 1) 


~(210 + 20 4 2-1) 


ROUND TO 210 — 1 
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{ 
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Zero Flag — The zero flag is HIGH if the final result of an 
operation is zero. For operations producing an IEEE floating- 
point number, the flag accompanies outputs +0 (00000000;.) 
and —0O (8000000016). For operations producing an integer, the 
flag accompanies the output 0 (0000000046). 


NAN Flag — The NAN flag is HIGH ifan R PLUS S, RMINUSS, R 
TIMES S, 2 MINUS S, or FP-TO-INT operation produces a NAN 
as a final result. 


OPERATION IN DEC MODE 


When input signal IEEE/DEC is LOW, the DEC mode of operation 
is selected. In this mode the Am29325 uses the single- precision 
floating-point format (floating F) set forth in Digital Equipment 
Corporation's VAX Architecture Manual. In addition, the DEC 
mode complies with most other aspects of single-precision 
floating-point operation outlined in the manual — differences are 
discussed in Appendix B. 


DEC Floating-Point Format 


The DEC single-precision floating-point word is thirty-two bits 
wide, and is arranged in the format shown in Figure 17. The 
floating-point word is divided into three fields: a single-bit sign, 
an eight-bit biased exponent, and a 23-bit fraction. 


The sign bit indicates the sign of the floating-point number's 
value. Non-negative values have a sign of 0, negative values a 
sign of 1. 


The biased exponent is an eight-bit unsigned integer field repre- 
senting a multiplicative factor of some power of two. The bias 
value is 128. lf, for example, the multiplicative factor for a 
floating-point number is to be 24, the value.of the biased expo- 
nent would be a+ 128; a is called the true exponent. 


The fraction is a 23-bit unsigned fractional field containing the 23 
least-significant bits of the floating-point number's 24-bit man- 
tissa. The weight of this field's most significant bit is 2-2; the 
weight of the least-significant bit is 2-24. 


A floating-point number is evaluated or interpreted per the fol- 
lowing conventions: 


let s = sign bit 
e = biased exponent 
f = fraction 


ife = Oands =0...value = 0 

ife = 0 ands = 1...value = DEC reserved operand 
ifO0 <e < 255..value = (—1)S+(2e- 128)s(.1f) 
(normalized number) 


Zero — The value zero always has a sign of zero. 


DEC Reserved Operand — A DEC reserved operand does not 
represent a numeric value, but is interpreted as a signal or sym- 
bol. DEC reserved operands are used to indicate invalid opera- 
tions and operations whose results have overflowed the destina- 
tion format. They may also be used to pass symbolic information 
from one calculation to another. 


Normalized Number — A normalized number represents a 
quantity with magnitude greater than or equal to 2~ 128 but less 
than 2127, 


Example 1: 


The number +3.5 can be represented in floating-point format as 
follows: 


+3.5 = 11.19 x 20 


A112 x 22 
sign = 0 


Il 


biased exponent = 249 + 12819 = 13010 


100000105 


fraction = 110000000000000000000002 
(the leading 1 is implied in the format) 


Concatenating these fields produces the floating-point word 
416000001.. 


Example 2: 


The number —11.375 can be represented in floating-point 
format as follows: 


—11.375 = —1011.0115 x 20 
= ~,10110115 x 24 


M 


sign = 1 


biased exponent = 449 + 12819 = 13219 


100001005 


fraction = 01101100000000000000000. 
(the leading 1 is implied in the format) 


Concatenating these fields produces the floating-point word 
C23600004¢. 


DEC Mode Integer Format 


DEC mode integer format is identical to that of the IEEE mode. 
Integer numbers are represented as 32-bit, two’s complement 
words; Figure 7 depicts the integer format. The integer word can 
represent a range of integer values from —231 to 231-1, 


Operations 


All eight floating-point ALU operations discussed in the Gen- 
eral Description section can be performed in DEC mode. 


Figure 17. DEC-Mode Floating-Point Format 


SIGN BIASED 
BIT (S) EXPONENT (E) 


BITNUMBER; 31 


VALUE = (—1)S (2E-128) (.1F) 





FRACTION (F) 
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Various exceptional aspects of the R PLUS S, R MINUS S, R 
TIMES S, 2 MINUS S, INT-TO-FP, and FP-TO-INT operations 
for this mode are described below. The IEEE-TO-DEC and 
DEC-TO-IEEE operations are discussed separately in the 
IEEE-TO-DEC and DEC-TO-IEEE Operations section on 
page 23. 


Operations with DEC Reserved Operands — DEC reserved 
operands arise in two ways: they can be generated by the 
Am29325 to indicate that an invalid operation or floating-point 
overfiow has taken place, or they can be provided by the user as 
an input operand. 


When a DEC reserved operand appears as an input operand, the 
final result of the operation is the same DEC reserved operand. If 
an operation has two DEC reserved operands as inputs, the DEC 
reserved operand on the R port becomes the final result. 


The NAN flag will be HIGH whenever an operation produces a 
DEC reserved operand as a final result. 


Example 1: 


Suppose the floating-point addition operation is performed with 
the following input operands: 

R port: 4080000016 (0.1+21) 

S port: 800123451 (DEC reserved operand) 

Result: This operation produces the DEC reserved operand on 


the S port, 8001234546, as the final result. The NAN flag 
will be HIGH. 


Example 2: 


Suppose the floating-point multiplication operation is performed 
with the following input operands: 


R port: 807654321, (DEC reserved operand) 
S port: 80000001145 (DEC reserved operand) 


Result: Since both input operands are DEC reserved operands, 
the operand on the R port, 807654324g, is the final 
result of the operation. The NAN flag will be HIGH. 


Operations Producing Overflows — If an operation produces a 
rounded result that is too large to fit in the destination format, that 
operation is said to have overflowed. 


A floating-point overflow occurs if a R PLUS S, R MINUS S, R 
TIMES S, or 2 MINUS S operation with finite input operand(s) 
produces a result which, after rounding, has a magnitude greater 
than or equal to 2127, The final result in such cases will be DEC 
reserved operand 80000000j¢; the overflow, inexact, and NAN 
flags will be HIGH. 


Integer overflow occurs when the fixed-to-floating-point conver- 
sion operation attempts to convert to integer a floating-point 
number which, after rounding, is greater than 231-1 or less than 
—231. The final result in such cases will be DEC reserved 
operand 8000000046; the invalid operation flag will be HIGH. 
Note that the overflow and inexact flags remain LOW for integer 
overflow. 


Operations Producing Underflows — If an operation produces 
a floating-point result which, after rounding, has a magnitude 
too small to be expressed as a normalized floating-point num- 
ber, but greater than zero, that operation is said to have under- 
flowed. Underflow occurs when an R PLUS S, R MINUS S, or R 
TIMES S operation produces a result which, after rounding, 
has magnitude: 


0 < magnitude < 2-128. 


The final result in such cases will be 0 (0000000046). The under- 
flow, inexact, and zero flags will be HIGH. 
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Underflow does not occur if the destination format is integer. If the 
infinitely precise result of a floating-point-to-integer conversion 
has a magnitude greater than O and less than 1, but the rounded 
result is 0, the underflow flag remains LOW. 


Invalid Operations — If an input operand is invalid for the opera- 
tion to be performed, that operation is considered invalid. In DEC 
mode, there are only two invalid operations: 


— Performing a floating-point-to-integer conversion on a value 

too large to be expressed as a 32-bit integer. In this case the 
final result willbe DEC reserved operand 8000000046, and the 
invalid operation and NAN flags will be HIGH. 
Performing a floating-point-to-integer conversion on a DEC 
reserved operand. In this case the final result will be the input 
DEC reserved operand, and the invalid operation and NAN 
flags will be HIGH. 


Sign Bit 


For all operations producing a DEC floating-point result, the sign 
bit of the final result is unambiguous, i.e., there is only one sign bit 
value that yields a numerically correct result. 


Rounding 


There are four rounding modes for DEC operation: round to 
nearest, round toward +=, round toward —~, and round toward 0. 
The round toward +~, round toward —~, and round toward 0 
modes are performed in a manner identical to that for IEEE 
operation; refer to the Rounding section under Operation in 
IEEE Mode on page 12. The round to nearest mode is similar to 
that for IEEE operation, but differs in one respect: for the case in 
which the infinitely-precise result of an operation is exactly 
halfway between two representable values, DEC round to 
nearest mode rounds to the value with the larger magnitude, 
rather than to the value whose LSB is 0. 


Flag Operation 


The Am29325 generates six status flags to monitor floating-point 
processor operation. The following is asummary of flag operation 
in DEC mode: 


Invalid Operation Flag — The invalid operation flag is HIGH if the 
FP-TO-INT operation is performed ona floating-point number too 
large to be converted to an integer, or on a DEC reserved 
operand. If the FP-TO-INT operation is performed on a floating- 
point number too large to be converted to integer, the final result is 
the DEC reserved operand 8000000046. If the FP-TO-INT oper- 
ation is performed on a DEC reserved operand, that operand 
becomes the final result. 


Overflow Flag — The overflow flag is HIGH if an R PLUS S, R 
MINUS S, R TIMES S, or 2 MINUS S operation produces a result 
which, after rounding, has a magnitude greater than or equal to 
2127. The final result will be the DEC reserved operand 
800000006. 
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Underflow Flag — The underflow flag is HIGH ifan R PLUS S, R 
MINUS S, or R TIMES S operation produces a result which, after 
rounding, has a magnitude in the range: 


0 < magnitude < 2-128, 
The final result will be 0 (0000000046) in such cases. 


Inexact Flag — The inexact flag is HIGH if the final result of an R 
PLUS S, R MINUS S, R TIMES S, 2 MINUS S, INT-TO-FP, or 
FP-TO-INT operation is not equal to the infinitely precise result. 
Note that if the underflow or overflow flag is HIGH, the inexact flag 
will also be HIGH. 


Zero Flag — The zero flag is HIGH if the final result of an 
operation is zero. For operations producing an integer or a DEC 
floating-point number, the flag accompanies the output 0 
(0000000046). (It should be noted that any operation producing a 
floating-point 0 in DEC mode will output 000000004.) 


NAN Flag — The NAN flag is HIGH if an RPLUSS, RMINUSS,R 
TIMES S, 2 MINUS S, or FP-TO-INT operation produces a DEC 
reserved operand as the final result. 


IEEE-TO-DEC AND DEC-TO-IEEE OPERATIONS 


The IEEE-TO-DEC and DEC-TO-IEEE operations are used to 
convert floating-point numbers between the IEEE and DEC for- 
mats. Both operations work in a manner independent of the 
IEEE/DEC mode control. 


IEEE-TO-DEC Conversion 


This operation converts an IEEE floating-point number to DEC 
floating-point format. Most conversions are exact; in no case 


does the round mode have any effect on the final result. There 
are, however, a few exceptional cases: 


a.) If the JEEE floating-point input has a magnitude greater than 
or equal to 2127, it is too large to be represented by a DEC 
floating-point number. The final result will be the DEC re- 
served operand 800000001¢; the overflow, inexact, and NAN 
flags will be HIGH. 


b.) If the IEEE floating-point inputis a NAN, the final result will be 
the DEC reserved operand 8000000046; the invalid and NAN 
flags will be HIGH. 

c.) If the IEEE floating-point input is a denormalized number, 


the final result will be a DEC 0 (0000000046); the zero flag 
will be HIGH. 

d.) If the IEEE floating-point inputis +0 or —0, the final result will 
be a DEC 0 (0000000046); the zero flag will be HIGH. 


DEC-TO-IEEE Conversion 


This operation converts a DEC floating-point number to IEEE 
floating-point format. Most conversions are exact; in no case 
does the round mode have any effect on the final result. There 
aré, however, a few exceptional cases: 


a.) If the DEC floating-point input is not 0, but has a magnitude 
less than 2-126, it is too small to be expressed as a nor- 
malized IEEE floating-point number. The final result will be an 
IEEE floating-point 0 having the same sign as the input 
(000000004¢ for positive inputs and 8000000046 for negative 
inputs); the underflow, inexact, and zero flags will be HIGH. 

b.) Ifthe DEC floating-point inputis a DEC reserved operand, the 
final result will be quiet NAN 7FA000004¢; the invalid opera- 
tion and NAN flags will be HIGH. 

c.) If the DEC floating-point inputis 0, the final result will be IEEE 
floating-point +0 (0000000046); the zero flag will be HIGH. 
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APPENDIX A: 


Differences Between the IEEE Proposed Standard for Binary 
Floating-Point Arithmetic and the Am29325’s IEEE Mode 


When operated in IEEE mode, the Am29325 High-speed 
Floating-Point Processor complies with the single- precision por- 
tion of the IEEE Proposed Standard for Binary Floating-Point 


Arithmetic (F754, draft 10.0) in most respects. Ihere are, how- 
ever, several differences: 


Denormalized Numbers 


The Am29325 does not handle denormalized numbers. A de- 
normalized input will be converted to a zero of the same sign 
before the specified operation takes place. The operation pro- 
ceeds in exactly the same manner as if the input were +0 or —0, 
producing the same numerical result and flags. 


If the result of an operation, after rounding, has a magni- 
tude smaller than 2~ 126, the result is replaced by a zero of the 
same sign. 


Representation of Overflows 


In some rounding modes, the proposed IEEE standard requires 
that overflows be represented as the format’s most positive or 
most negative finite number. In particular: 


— Whenrounding toward 0, all overflows should produce a result 
of the largest representable finite number with the sign of the 
intermediate result. 


When rounding toward —~, all positive overflows should pro- 
duce a result of the largest representable positive finite 
number. 


When rounding toward +=, all negative overflows should 
produce a result of the largest representable negative finite 
number. 


The Am29325, however, always represents positive overflows as 
+x andnegative overflows as —x, regardless of rounding mode. 


Projective Mode 


The proposed IEEE standard provides only for an affine mode to 
contro! the handling of infinities. The Am29325 provides both 
affine and projective modes; the desired mode can be selected by 
the user. 
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Traps 


The proposed IEEE standard stipulates that the user be able to 
request a trap on any exception. The Am2935 does not support 
trapped operation, and behaves as if traps are disabled. 


Resetting of Flags 


The proposed IEEE standard states that once an exception flag 
has been set, it is reset only at the user’s request. The Am29325's 
flags, however, reflect the status of the most recent operation. 


Generation of the Underflow Flag 


The proposed IEEE standard suggests several possible criteria 
for determining if underflow occurs. These criteria generate 
underflow flags that differ in subtle ways. The underflow criteria 
chosen for the Am29325 stipulate that underflow occurs if: 


a) the rounded result of an operation has a magnitude in the 
range: 


0 < magnitude < 2-126, 


and 
b) the final result is not equal to the infinitely precise result. 


Since the Am29325 never produces a denormalized number as 
the final result of a calculation, condition (b) is true whenever (a) is 
true. Note, then, that the operation of the Am29325's underflow 
flag is somewhat different than that of an “IEEE standard” system 
using the same underflow criteria. For example, if an operation 
should produce an infinitely precise result that is exactly 2~ 127, 
an “IEEE standard” system would produce that value as the final 
result, expressed as a denormalized number. Since that system's 
final result is exact, the underflow flag would remain LOW. The 
Am29325, on the other hand, would output zero; since its final 
result is not exact, the underflow flag would be HIGH. 
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APPENDIX B: 


Differences Between DEC VAX and Am29325 DEC Mode 


Operation in DEC mode complies with most aspects of single- 
precision floating-point operation outlined in the Digital Equip- 
ment Corporation's VAX Architecture Manual. However, there 
are some differences that should be noted: 


Format 


The Am29325's DEC format is: 
sign — bit 31 
exponent — bits 30-23 
mantissa — 22-0 


The VAX format is: 
sign — bit 15 
exponent — 14-7 
mantissa — bits 6-0, bits 31-16. 


In both cases, fields are listed from MSB to LSB, with bit 31 the 
MSB of the 32-bit word. The Am29325's DEC format can be 
converted to VAX format by swapping the 16 LSBs and 16 MSBs 
of the 32-bit word. 


Flags vs. Exceptions 


In DEC VAX operation, certain unusual conditions arising during 
system operation may incur an exception, or an indication to the 
operating system that special handling is needed. 


The VAX recognizes a number of arithmetic exceptions. The 
following exceptions are relevant to the operations supported by 
the Am29325: 


Integer overflow trap — indicates that the last operation 
produced an integer overflow. The LSBs of the correct result 
are stored in the destination operand. 


Floating-point overflow trap/fault — indicates that the last 
operation produced, after normalization and rounding, a 
floating-point number with magnitude greater than or equal 
to 2127. A trap replaces the destination operand with the 
DEC reserved operand 80000000 4¢; a fault leaves the de- 
stination operand unchanged. 


Floating-point underflow trap/fault — indicates that the last 
operation produced, after normalization and rounding, a 
floating-point number with magnitude less than 2~ 128, A 
trap replaces the destination operand with zero; a fault 
leaves the destination operand unchanged. 


Reserved operand fault — indicates that the last operation 
had a reserved operand as an input. The destination 
operand is unchanged. 


The Am29325 does not directly support DEC traps and faults. 
Rather, it indicates unusual conditions by setting one or more of 
the six status flags HIGH. Table d2 describes flag operation in 
DEC mode. 


Integer Overflow 


In cases of integer overflow, the VAX signals the integer overflow 
trap and stores the LSBs of the correct result. The Am29325 sets 
the invalid operation flag and outputs the DEC reserved operand 
8000000016. 


Floating-Point Underflow/Overflow Operation 


The VAX Architecture Manual specifies the action to be taken on 
the destination operand when floating-point underflow or over- 
flow is encountered. The Am29325 has no immediate control 
over this destination operand, as it resides somewhere off-chip, 
either in a register or memory location. This isn’t so much a 
difference between the VAX specification and Am29325 opera- 
tion as it is a difference in scope. 


The Am29325 responds to floating-point underflow by producing 
a final result of 0 (0000000046); the underflow, inexact, and zero 
flags will be HIGH. It responds to floating-point overflow by pro- 
ducing the DEC reserved operand 800000004¢ as the final result; 
the overflow, inexact, and NAN flags will be HIGH. 


Handling of DEC Reserved Operands 


If an operation has a DEC reserved operand as an input, the 
Am29325 will produce that operand as the final result. If an 
operation has two input arguments and both are DEC reserved 
Operands, the operand on port R becomes the final result. For the 
VAX, operations with a DEC reserved operand input or inputs do 
not modify the destination operand. As mentioned above, control 
of the destination operand is beyond the scope of the Am29325's 
operation. 


Inexact Flag 


The Am29325 provides an inexact flag to indicate that the final 
result produced by an operation is not equal to the infinitely 
precise result. The VAX does not provide this flag. 





APPENDIX C: 
Performing Floating-Point Division on the Am29325 


While the Am29325 does not have a floating-point division in- 
struction, it can be used to evaluate reciprocals. The division: 


C = A+(1/B). 
Only a modest amount of external hardware is needed to imple- 
ment the reciprocal function. 


The technique for calculating reciprocals is based on the 
Newton-Raphson method for obtaining the roots of an equation. 
The roots of equation: 
F(x) = 0 
can be found by iteratively evaluating the equation 
Xi+1 = Xj — F(xi)/F'(x)). 


The process begins by making a guess as to the value of x;, and 
using this guess or “seed” value to perform the first iteration. 
Iterations are continued until the root is evaluated to the desired 
accuracy. The number of iterations needed to achieve a given 
accuracy depends both on the accuracy of the seed value and the 
nature of F(x). 


Now consider the equation 
F(x) = (1/x) — B. 
The root of F(x) is 1/B. The reciprocal of B, then, can be found by 
using the Newton-Raphson method to find the root of F(x). The 
iterative equation for finding the root is 
x) — F(xi)/F"(xi) 
xi — (1/x, — BY—(xj) 72 
xj (2— Bexj). 
It can be shown that, in order for this iterative equation to con- 
verge, the seed value xg must fall in the range 
0< Xg < 2/B ifB>0 
2IB<x9<0 ifB<0O. 


For example, if the reciprocal of 3 is to be evaluated, the seed 
value must be between 0 and 2/3. 


Xi+4 


or 


The error of xj reduces quadratically; that is, if the error of x; is e, 
the error is reduced to order e2 by the next iteration. The number 
of bits of accuracy in the result, then, roughly doubles after every 
iteration. While this is only an approximation of the actual error 
produced, it is a handy rule-of-thumb for determining the number 
of iterations needed to produce a result of a certain accuracy, 
given the accuracy of the seed. 


Example 1: 
Find the reciprocal of 7.25. 
Solution: 
The seed value must fall in the range 
0 < x9 < 2/7.25 
0 < XQ < .275862. 
Suppose xo is chosen to be .1 


xo (2— Brxo) 
1(2—(7.25) (.1)) 
1275 


or 


Iteration 1: xy 
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Iteration 2: xo 


X4 (2-Brxy) 
.1275(2—(7.25) (.1275)) 
.1371421875 

X2(2 — Bexo) 
1371421875« 


(2—(7.25) (.1871421875)) 
= 1379265230 


Iteration 3: x3 


iT 


The actual value of 1/7.25, to ten decimal places, is 
.1379310345. 


The error after each iteration is: 


Iteration Error to Ten Places 


Xj 
A ~ 0.03793 10345 
.1275 





—0.0104310345 
—0.0007888470 
—0.0000045115 





1371421875 
.1379265230 








Example 2: 
Find the reciprocal of —.3. 
Solution: 
The seed value must fall in the range 
2/(—.3) < Xg < O 


or -—6.66 < x9 <0. 

Suppose xg is chosen to be —2.0. 

Iteration 1: x4 = x9(2-B+xg) 
= —2.0(2-(—.3) (—2.0)) 
= -2.8 

Iteration 2: xo = x4 (2-Bex;) 
= —2.8(2—(—.3) (—2.8)) 
= —3.248 

Iteration 3: xg = xo(2—B*x2) 
= —3.248(2—(—.3) (—3.248)) 
= —3.3311488 

Iteration 4: x4 = x3(2—-Be*x3) 
= —3.3311488+ 


(2—(-.3) (-3.3311488)) 
~3.333331902 


The actual value of 1/(—.3), to ten decimal places, is 
—3.333333333. 


The error after each iteration is: 


i Xj Error to Ten Places 
-2.0 


-28 


1.333333333 
0.533333333 
0.085333333 
0.002184533 
0.000001431 








—3.248 
~3.3311488 
— 3.33333 1902 











In order to implement the Newton-Raphson method on the 
Am29325, some means is needed to generate the seed used 
in the first iteration. One approach is to place a hardware 
seed look-up table between the R bus and the Am29325; see 
Table c1. A more detailed diagram of the look-up table appears 
in Figure c2. 
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TABLE c1. CONTENTS OF THE SEED EXPONENT PROM 


ee 


Address (16) Data (16) Address (16) Data (16) 
000 FD 


(Note 1) 

001 (Note 1) 
002 FF 
003 FE 
004 FD 
005 FC 
006 FB 
007 FA 
008 F9 
009 F8 
OOA F7 
00B F6 
00Cc F5 
00D F4 
F3 
OOF F2 
FA 
FO 
EF 


OE 
OD 
oc 
0B 
0A 
09 
08 
07 
06 
05 
04 
03 
02 
01 
(Note 2) 
(Note 2) 
(Note 2) 


1. The reciprocals of these numbers are too large to be represented in DEC 
format. 

2. The reciprocals of these numbers are too small to be represented in 
normalized IEEE format. 





Figure c1. Adding a Hardware Look-Up Table to the Am29325 


R BUS 


S$ BUS 


HARDWARE 
LOOK-UP 
TABLE 





F BUS 05621A-20 
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The look-up table has two sections: a biased exponent look-up 
PROM and a fraction look-up PROM. The seed biased exponent 
look-up table is stored in a 512-by-8-bit PROM. This table con- 
sists of two sections — the DEC format section, which occupies 
addresses 000—OFFy6, and the IEEE section, which occupies 
addresses 100—1FF1g. The appropriate table will be selected 
automatically if address line Ag is wired to the Am29325's IEEE/ 
DEC pin. The equations implemented by these table sections are: 


DEC table: seed biased exponent 

= 25749 —input biased exponent 
IEEE table: seed biased exponent 

= 25249 —input biased exponent 


Table c1 lists the contents of this PROM. 


The seed fraction look-up table is stored in one or more PROMs, — 


the number of PROMs depending on the desired accuracy of the 
seed value. The hardware depicted in Figure c2 uses two 4K- 
by-8-bit PROMs to implement a fraction look-up table whose 
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inputs are the 12 MSBs of the input argument’s fraction. These 
PROMs output the 16 MSBs of the seed’s fraction field — the 
remaining 7 bits of fraction are set to 0. The equation im- 
plemented in this table is: 


seed fraction = ae —1 


1 + input fraction 


where the value of the input fraction falls in the range 


0 = input fraction < 1. 


Note that the seed fraction must also be constrained to fall in 
the range 


0 = seed fraction < 1. 


Therefore, if the input fraction is 0, the corresponding seed frac- 
tion stored in the table must be .1111...1119, not 1.02. The same 
seed fraction look-up table may be used for both IEEE and DEC 
formats. Table c2 contains a partial listing for the seed fraction 
look-up table shown in Figure c2. 


TABLE c2. CONTENTS OF THE SEED FRACTION PROMs 














































Address (16) 

0.0 

0.000244 1406 

0.0004882812 

0.0007324219 

0.0009765625 
005 0.0012207031 
006 0.0014648438 
007 0.0017089844 
008 0.0019531250 
009 0.0021972656 
OOA 0.00244 14063 
00B 0.0026855469 
00C 0.0029296875 
FF6 0.9975585938 
FF7 0.9978027344 
FF8 0.9980486750 
FF9 0.9982910156 
FFA 0.998535 1563 
FFB 0.9987792969 
FFC 0.9990234375 
FFD 0.9992675781 
FFE 0.9995117188 

0.9997558594 


0.9999999999 (see text) 




















PROM Outputs (16) 
Value of Input Fraction (10) | Value of Seed Fraction (10) R22o—Ry5| R14—Ryz 


FF 


0.9995118370 
0.9990239150 
0.9985362280 
0.9980487790 
0.9975615710 60 
0.9970745970 40 
0.9965878630 20 
0.9961013650 00 
0.9956151030 E1 
0.9951290800 co 
0.9946432920 Al 
0.994 1577400 81 
0.0012221950 50 
0.00109984 10 48 
0.0009775170 40 
0.0008552230 38 
0.0007329590 30 
0.0006107240 28 
0.0004885200 
0.0003663450 
0.0002442000 


0.0001220850 






Figure c2. The Hardware Lookup-Up Table 


R BUS e 
e 
1 8 12 
SIGN BIASED 12 MSBs 
(R31) EXPONENT OF FRACTION 
(R39-R23) (Ro2-Ry4) 
IEEE/DEC 


Ag A7-Ao 


Am27S15 512 x 8 
SEED EXPONENT PROM 


D7-0o 





SEED SIGN SEED EXPONENT 
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(2) Am27S43 4K x 8 


SEED FRACTION PROMs 


SEED FRACTION 





FF 
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With the hardware look-up table in place, the reciprocal of value B 
can be calculated with the following series of operations: 


1.) Place B on both the R and S buses. The 2: 1 multiplexer at 
the output of the hardware look-up table should select the 
output of the look-up table. (see Figure c3-a) 


2.) Load the seed value xq into register R and load B into register 
S. Select the R TIMES S operation. (see Figure c3-b) 


3.) Load product B+xg into register F. Select the 2 MINUS S 
operation, and select register F as the input to the ALU S port. 
(see Figure c3-c) 


4.) Load 2— Bxg into register F. Select the R TIMES S operation 
and select register F as the input to the ALU S port. (see 
Figure c3-d) 


5.) Load the value x, (x;=X9(2—B*xg)) into registers R and F. 
Select the R TIMES S operation. (see Figure c3-e) 


6.) Repeat steps 3 through 5 until the result has the accuracy 
desired. 


Figure c3-a. Data Flow for Step 1 of the Reciprocal Procedure 


REGISTER F 


05621B-22 
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Figure c3-b. Data Flow for Step 2 of the Reciprocal Procedure 


REGISTER S 
[B] 


REGISTER R 
[Xo] 


REGISTER F 
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Figure c3-c. Data Flow for Step 3 of the Reciprocal Procedure 


REGISTER S 
[8] 
REGISTER R 
[Xo] 


REGISTER F 
[B* Xo] 
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| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
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BUS S 


BUS R 


BUS F 


Figure c3-d. Data Flow for Step 4 of the Reciprocal Procedure 






So- S31 





REGISTER S 






REGISTER R 
(X1 (Xt = Xo (2-B+Xo)] 








Xq (Xq_ = Xo (2-B+Xo)) 


REGISTER F 
[2-B * Xo] 
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Figure c3-e. Data Flow for Step 5 of the Reciprocal Procedure 


REGISTER R 
[X1 (Xr = Xo (2-BeXo))] 


REGISTER F 
[Xs (X= Xo (2— BeXo))] 


05621B-26 





A tabular description of the operations above is givenin Table c3. 
The following examples, performed in IEEE format, illustrate the 


process. 
Example 1: 
Find the reciprocal of 25.3. 


Solution: The IEEE floating-point representation for 25.3 is 
41CA666615. The reciprocal process is begun by 
feeding this value to both the seed look-up table and 


TABLE c3. SEQUENCE OF EVENTS FOR EVALUATING RECIPROCALS 


Y 


| 
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port S. The look-up table produces the value 
.0395278919 (3D21E8001g). The reciprocal is 
evaluated using the procedure described above; reg- 
ister values for each step are given in Table c4. The 
expected result, to the precision of the floating-point 
word, is .0395256919 (3D21E5B146). In this case the 
expected result is produced after the first iteration. All 
subsequent iterations produce the same result, and 
are therefore unnecessary. 


Register R 


Register S 


Register F 








R TIMES S 
2 MINUS S 


B+Xy 





2-BeXo 





| R TIMES S 
R TIMES S 


X4(= Xo(2-BeX)) 





2MINUS S 


xy 


BeX, 





R TIMES S 





| 


xy 








2-BeX, 








R TIMES S 


X = DON'T CARE 


Clock 
Cycle 


3D21E800 
(.03952789) 


R Input 





S Input 


41CA6666 46 
(25.3) 





Cololfo!]cClojolo|/xjz 





Register R 





Xo(= X4(2-BeX,)) 


Register S 


TABLE c4. INPUT BUS AND REGISTER VALUES FOR EXAMPLE 1 


Register F 


Xo(= X4(2~BeX,)) 





3D21 E80016 
(.03952789) 


4 


41CA6666 16 
(25.3) 





3D21E80016 
(.03952789) 


41CA666616 
(25.3) 


3F8001D34¢ 
(1.0000556) 





3D21E80046 
(.03952789) 


41CA6666 16 
(25.3) 


3F7FFC5A46 
(.99984419) 





3D21E5B1 4g 
(.03952569) 





41CA6666 16 
(25.3) 


3D21E5B116 
(03952569) 


—® Result of first 
iteration 





3D21E5B146 
(.03952569) 


41CAG666 16 
(25.3) 


SF7FFFFF 46 
(.99999994) 





3D21E5B146 
(.03952569) 


41CA6666 16 
(25.3) 


3F8000004¢ 
(1.0) 





3D21E5B1 46 
(.03952569) 


41CA6666 46 
(25.3) 


3D21E5B1 46 
(.03952569) 


First 
iteration 


Second 
iteration 


~=t— Result of second 


iteration 
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Example 2: evaluated using the procedure described above; reg- 
ister values for each step are given in Table c5. The 


; : ; ; ; expected result, to the precision of the floating-point 
Solution: The IEEE floating-point representation for —.4725 is word, is —2.11640219 (C007732246). In this case the 


Find the reciprocal of —.4725. 


BEFIEB851. The reciprocal process is begun by expected result is produced after the first iteration. All 
feeding this value to both the seed look-up table and subsequent iterations produce the same result, and 
port S. The look-up table produces the value are therefore unnecessary. 


—2.11621094149 (C00770001¢). The reciprocal is 





TABLE c5. INPUT BUS AND REGISTER VALUES FOR EXAMPLE 2 


Clock 
Cycle R Input S Input Register R Register S Register F 















C00770004¢ BEF 1EB854¢ 



















































| (~2.1162109) (-0.4725) 
2 - = C00770004¢ BEF1EB851¢ 
(—2.1162109) (-0.4725) 
3 C007700046 BEF1EB851¢ 3F7FFA1416 
(-2.1162109) (-0.4725) (0.99990963) 
4 C007700046 BEF1EB8516 3F8002F6 4g 
(—2.1162109) (-0.4725) (1.0000904) 
5 - - C007732246 BEF1EB85 16 C007732246 


~ Result of first 


(- 2.116402) : : 
iteration 


(—0.4725) (-2.116402) 

























- - C007732246 BEF1EB851¢ 3F8000004¢ 
| (-2.116402) (-0.4725) (1.0) 
- C007732216 BEF1EB851¢ 3F8000001¢6 










(-2.116402) 


C007732246 
(+2.116402) 


(—0.4725) 


BEF1EB8516 
(-0.4725) 


(1.0) 


C007732216 
(-2.116402) 



















~<t- Result of second 
iteration 
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APPENDIX D: 
Summary of Flag Operation 


Tables d1, d2, and d3 summarize flag operation for the IEEE 
mode, the DEC mode, and for the IEEE-TO-DEC and DEC-TO- 
IEEE operations. 


TABLE d1. FLAG SUMMARY FOR! 


Operation Condition(s) INV 


Any operation 
listed in the 
IEEE Invalid 
Operations Table 





RPLUSS Input operands are finite, 
R MINUS S [rounded result| = 2128 
R TIMES S 
2 MINUS S 


R PLUS S 
R MINUS S 0 < |rounded result| < 2-126 
R TIMES S 


R PLUS S Final result does not equal 
R MINUS S infinitely precise result 

R TIMES S 

2 MINUS S 

INT-TO-FP 

FP-TO-INT 


RPLUSS Final result is zero 
R MINUS S 
R TIMES S 
2 MINUS S 
INT-TO-FP 
FP-TO-INT 


R PLUS S Final result is a NAN 
R MINUS S 
R TIMES S 
2 MINUS S 
FP-TO-INT 











Notes: INV Invalid operation flag 
OVF Overflow flag 
UNF Underflow flag 
INE = Inexact flag 
ZER = Zero flag 
NAN NAN flag 
L LOW 
H HIGH 
* State of flag 

depends on the 
input operands 
and the operation 
performed 





Am29325 


TABLE d2. FLAG SUMMARY FOR DEC MODE 


Operation Condition(s) INV OVF UNF 


FP-TO-INT Rounded result > 231-1 
or rounded result < —231 


FP-TO-INT Input is a DEC reserved 
operand 








R PLUS S 
R MINUS S |Rounded result| = 2127 
R TIMES S 
2 MINUS S 


R PLUS S 
R MINUS S 0 < |rounded result| < 2~128 
R TIMES S 


R PLUS S Final result does not equal 
R MINUS S infinitely precise result 

R TIMES S 

2 MINUS S 

INT-TO-FP 

FP-TO-INT 


RPLUS S Final result is zero 
R MINUS S 
R TIMES S 
2 MINUS S 
INT-TO-FP 
FP-TO-INT 


R PLUS S Final result is a DEC 
R MINUS S reserved operand 

R TIMES S 

2 MINUS S 

FP-TO-INT 








Notes: INV = Invalid operation flag HIGH 
OVF = Overflow flag State of flag 
UNF = Underflow flag depends on the 
INE Inexact flag input operands 
ZER Zero flag and the operation 
NAN = NAN flag performed 
L = LOW 





TABLE d3. FLAG SUMMARY FOR IEEE-TO-DEC AND DEC-TO-IEEE CONVERSIONS 


Operation Condition(s) INV OVF UNF INE ZER NAN 
IEEE-TO-DEC Input is a NAN 

IEEE-TO-DEC [Input] = 2127 

DEC-TO-IEEE Input is a DEC reserved operand 

DEC-TO-IEEE 0 < |rounded result] < 2-126 

DEC-TO-IEEE Final result is zero 

















IEEE-TO-DEC 

Notes: INV. = Invalid operation flag H = HIGH 
OVF = Overflow flag * = State of flag 
UNF = Underflow flag depends on the 
INE = Inexact flag input operands 
ZER = Zero flag and the operation 
NAN = NAN flag performed 
L = LOW 
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PACKAGE INFORMATION 


PACKAGE PHOTOGRAPHS 





Top View Lateral View 





Bottom View Isometric View 
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ABSOLUTE MAXIMUM RATINGS OPERATING RANGES 
Storage Temperature ................0 eee —65 to +150°C Commercial (C) Devices 
Temperature Under Bias — Tc ............ —55 to +125°C Temperature (Ta) ..... 00. c cece ence eee 0 to +70°C 
Supply Voltage to Ground Potential Supply Voltage ............ 0... eee eee ee +4.75 to +5.25V 
GORTNUOUS: 6 oi onc nating deo oa Boaters —0.5 to +7.0V ea : 
fete Military (M) Devices 
DC Voltage Applied to Outputs Temperature (Tc) ......0 6. 0c cee eee eee eee —55 to +125°C 
for High State Senda bib auiatee Oe eeLAe yee Lee -0.5V to +Vcc Max Supply Voltage Lea ey ee +45 to +5.5V 
DC Input Voltage ..................2 0 eee —0.5 to +5.5V anes et ee Sa ae 
DC Output Current, into Outputs .................00. 30mA Operating ranges define those limits over which the functionality of the 
DG Input Current seh Meee ee —30 to +5.0mA Pelee ts: Quatantoee 


Stresses above those listed under ABSOLUTE MAXIMUM RATINGS 
may cause permanent device failure. Functionality at or above these 
limits is not implied. Exposure to absolute maximum ratings for ex- 
tended periods may affect device reliability. 


DC CHARACTERSITICS OVER OPERATING RANGE unless otherwise specified 


Test Conditions Typ 
Parameter Description (Note 1) Min (Note 2) Max Units 


Voc = Min 
Output HIGH Voltage Vin = Vit or Vin 
lon = —0.4mA 








Voc = Min 
Output LOW Voltage Vin = Vic or Vin 
lot = 4.0mA 


Guaranteed Input Logical 
Guaranteed Input Logical 
oo i ae eee eee 


Input LOW Current 














Input HIGH Current 





Input HIGH Current 











Fo—F31 Off State (High - 
Impedance) Output Current AV 

















Output Short Circuit Current 
(Note 3) 








+25°C 
0 to +70°C 
COM'L Only 
Power Supply Current (Note 4) = +70°C 
—55 to +125°C 
+ 125°C 

















MIL Only 








Notes: 1. For conditions shown as Min or Max, use the appropriate vatue specified under Operating Ranges for the applicable device type. 
. Typical values are for Voc = +25°C ambient and maximum loading. 

. Not more than one output should be shorted at a time. Duration of the short circuit test should not exceed one second. 

. Measured with OE LOW, and with all output bits (Fg—F31 and flag outputs) LOW. 


> WP 
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SWITCHING CHARACTERISTICS 
OVER OPERATING RANGE 







Ta = 25°C | Ta = 0 to +70°C | To = —55 to 125°C 
Voc = 5.0V] Voc = +5V +5% | Voc = +5V +10% 








Test 


Parameters Conditions Units 









Description 


Clocked Add, Subtract Time (R PLUS S, 
R MINUS S, 2 MINUS S) 


Clocked Multiply Time (R TIMES S) 


Clocked Conversion Time (INT-TO-FP, 
FP-TO-INT, lEEE-TO-DEC, DEC-TO-IEEEF) 


Unclocked Add, Subtract Time (R, S to F, 
Flags) for R PLUS S, R MINUS S, 
and 2 MINUS S Instructions 


Unclocked Multiply Time (R, S to F, Flags) 
for R TIMES S Instruction 





































tasuc 

















Unclocked Conversion Time (R, S to F, 
Flags) for INT-TO-FP, FP-TO-INT, IEEE- 
TO-DEC and DEC-TO-IEEE Instructions 


Clock Pulse Width HIGH 
Clock Pulse Width LOW 


tcuc 








tpwH 





tpwi 









FTy = LOW 


"POOF 1 FT; = HIGH 





Clock to Fo—F3; and Flag Outputs 





tppoF2 


t 2a 
ae OE Enable Time 
tpzH 


pn LOW to Z 


as 
ee 
a 
OE Disable Time qaniee — | 
ae 
a 
pS oo 
ee 


- 
= 
Cc 
[>] 








Clock} to Fo—Fy5 Enable, Zto LOW | $16/32 = HIGH 
ONEBUS = LOW 


16-Bit /O Mode Z to HIGH 








Clock] to Fo—F45 Disable, LOW to Z 


16-Bit I/O Mode 





HIGH to Z 
Clock| to Fig—F31 Enable, | 2 to LOW 
16-Bit /O Mode Z to HIGH 
Clock? to Fyg—F3; Disable, | LOW toZ 
16-Bit I/O Mode HIGH to Z 

tsce Register Clock Enable Setup Time a a at 

tHCE Register Clock Enable Hold Time at 7 eae 
R31, $31 Hold Time (Note 1) 


$16/32 = HIGH 
ONEBUS = LOW 












































lo Instruction Select Hold Time Register = LOW 
Ip Instruction Select to Fo—F31, Flags FT, = HIGH 





Ro—R31, Sg—S31 Setup Time (Note 1) 
0 0-931 FTy = LOW 
o- R31, So- 

Ro—R31, So—S31 Setup Time (Note 1) FT) = HIGH 
o- R31, So- 


i ea 
Ro—-R31, So—$31 Hold Time (Note 1) FT, = LOW 

Ig—lo Instruction Select Setup Time FT for Destination 

lo 

: — 


a] 
no 








a 
n 





) 
n 


tsi3 lz Port S Input Select Setup Time 

= FT; = LOW 
lz Port S Input Select Hold Time 

tsi4 l4 Register R Input Select Setup Time (Note 1) 

= FTo = LOW 








I4 Register R Input Select Hold Time (Note 1) 
tsRM Round Mode Select Setup Time FT for Destination f 
Round Mode Select Hold Time Register = LOW Pcs 





ee ee eee a | oe | oe 
Tv ivIivn ju |}Da|U bx ss no ee} 
XZ tr IN IN Jz [ce N jz {cC 
NN [= [co |S [N © |N |N 
a 12 jn [OP [an | a 





tprF Round Mode Select to Fo—F31, Flags FT; = HIGH 





Notes: 1. See timing diagram for desired mode of operation to determine clock edge to which these setup and hold times apply. 
2. At air velocity of_linear feet per minute. 
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CLOCKED OPERATION: FTp = LOW 
FT, = LOW 


tase 
tuc 
toc 
“ em fe 


1 
L 'sce tuce4 


7, 


Ex, TRYXRXY LR OS 
: ce cael RO 


oe eee ee 
, XXXKX x) YYXXXKK XXXXKK KKK KKK KKK Y XXX 
ED Ney Cl Mi Cie 


6 
XY 


Ns KO) : 
thio2 


CLOCKED OPERATION: FTo = HIGH 
FT, = LOW 


= ef 
tgce- a] 
Ene SOWOOY) (| & 


REX YRYEXL REE canine 
OY CONRAN __ | RARE 
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tppor2 


RNOg- AND, 


05621A-32 
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CLOCKED OPERATION: FT9 = LOW 
FT, = HIGH 


- ee ey oe 


tsce tHce 
ewa, XXX) aes 
Ens YY YY ie seselelelelalatatatetararererereretetelet’ ane. 


tppor1 
Fo-Far, SOY OOO AAD 
las SRR so | RRR vo |) 
Y 


tso1 
"si RRROK | ORR eee fa 


QO? II 000 
So-S31 Wy ) ? ROY KY 


YY 


)) 
a ROY 


= et aie 


RNDp-RND, 
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FLOW-THROUGH OPERATION (FT9 = HIGH, FT, = HIGH) 


tasuc 
tmuc 


tppio2 


: VV XX KX XY) 
ANPOTANPY . MOYRRK RY 


RRRERERRER RRR 40 
wuiieuiicoiinoircniecniccnicees 
RK ARRAY XY 


0 Wy 


OY) 
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XXX) 
ROY 
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16-BIT, TWO-INPUT-BUS MODE 


R INPUT BUS, KON 


SmPur BUS XXKKOY 


\4 {x XXXX VY Y XxX) 
ay RAY 


tezL16, 
tpzHi16e 


VALID 


Note 1. Ig has special setup and hold time requirements in this mode. All other control signals have timing requirements as shown in the diagram 
“Clocked operation, FTg = LOW, FT; = LOW.” 
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OUTPUT ENABLE/DISABLE TIMING 


(HIGH LEVEL) 


(LOW LEVEL) 
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Am29325 PINOUT 


SORTED BY PIN NUMBER SORTED BY FUNCTIONAL NAME 


AN 1 


Inexact 
Invalid 

Fag 
30 
Fo3 













OOAOAN OO LAN — 





Overflow 
B6 Foz 
B8 Fig 


















IEEE/DEC 
D3 ENR 
GND, TTL 








Bosh ica: 

E15 PROJ/AFF 14 
ONEBUS |EEE/DEC 
FT, Inexact 


$16/32 Invalid 
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Am29325 PINOUT (Cont) 


SORTED BY PIN NUMBER SORTED BY FUNCTIONAL NAME 
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Am29325 PINOUT (Cont) 


SORTED BY PIN NUMBER SORTED BY FUNCTIONAL NAME 


P7 Rs 





Underflow 
Vcc, ECL 
Vcc, ECL 
Vec, ECL 
Voc, ECL 
Voc, ECL 
Vcc, ECL 
Voc, ECL 
Voc, ECL 
Voc, TTL 
Voc, TTL 
Zero 


POWER SUPPLY WIRING CONSIDERATIONS 


Am29325 
PACKAGE 


SYSTEM eee 
Voc PLANE 


100pF 
CERAMIC 05621A-34 


Notes: 1. All power supply pins must be connected. 

2. ECL GND and TTL GND should not be connected directly into the main system ground plane. Using signal plane traces as short and wide as 
possible, ECL GND pins should be connected together, as should TTL GND pins, but without interconnection. These separate ground buses 
should be connected together and to the system ground plane at a decoupling capacitor close to the package. ECL Vcc and TTL Voc should be 
treated similarly. See diagram above. 
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SUGGESTED PRINTED CIRCUIT BOARD LAYOUT 


Bottom View 


EF GH JK 


o Oo N OD HO &® BS DD em 


rr a ee oe a 
oO e@ |W NY 2 OO 


TTL GND 


Note: 1. D4 (alignment pin) is not connected internally—may be wired to TTL ground or left unconnected. 


THERMAL CHARACTERISTICS 


THERMAL 
RESISTANCE 
°C/W 


AIR VELOCITY 
LINEAR FEET PER MINUTE 
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PHYSICAL DIMENSIONS 


BOTTOM VIEW 


1.540 
1.560 | 


A BCD EF GH J K LM N PR 


©COOKOHHHOHOHOOOOOO*K 
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HEATSINK 
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“Subject to change. 


The International Standard of Quality 
guarantees the AQL on all electrical parameters, 
AC and DC. over the entire operating range. 














Am293371 


16-Bit Microprogram Sequencer 
ADVANCED INFORMATION 


e 16-Bits Address Up to 64K Words 


Supports 80—90ns microcycle time for a 32-bit high 
performance system when used with the other members 
of the Am29300 Family. 


e Real Time Interrupt Support 


Micro-TRAP and Interrupts are handled transparently at 
any microinstruction boundary. 


@ Built-In Conditional! Test Logic 


Generates inequality evaluation branch conditions from 
four ALU status bits. Has eight external tests plus a 
polarity input. 


The Am29331 is a 16-bit wide high-speed single chip se- 
quencer designed to control the execution sequence of mi- 
croinstruction stored in the microprogram memory. The in- 
struction set is designed to resemble high-level language 
constructs, thereby bringing high-level language program- 
ming to the micro level. 


The Am29331 is interruptible at any microinstruction bound- 
ary to support real-time interrupts. Interrupts are handled 
transparently to the microprogrammer as an unexpected 
procedure call. Traps are also handled transparently at any 
microinstruction boundary. This feature allows re-execution 
of aprior microinstruction. Two separate buses are provided 
to bring a branch address directly into the chip from two 
sources to avoid slow turn-on and turn-off times for different 





MULTIWAY 
INPUTS 








DISTINCTIVE CHARACTERISTICS 


GENERAL DESCRIPTION 


SIMPLIFIED BLOCK DIAGRAM 


D-BUS A-BUS 


PROGRAM 
COUNTER 
BREAK PT. 
LOGIC 


IMOX is a trademark of Advanced Micro Devices, Inc. Y-BuS 






e Break-Point Logic 


Built-in address comparator allows break-points in the 
microcode for debugging and statistics collection. 


e Master/Slave Error Checking 


Two sequencers can operate in parallel as a Master anda 
Slave. The Slave generates a fault flag for unequal results. 


@ 33-Level Stack 


Provides support for interrupts, loops and subroutine 
nesting. It can be accessed through the D-bus to support 
diagnostics. 







sources connected to the data input bus. Four sets of multi- 
way inputs are also provided to avoid slow turn-on and 
turn-off times for different branch address sources. This 
feature allows implementation of table look-up or use of 
external conditions as part of a branch address. The thirty- 
three deep stack provides the ability to support interrupts, 
loops and subroutine nesting. The stack can be read through 
the D-bus to support diagnostics or to implement multi- 
tasking at the micro-architecture level. The master/slave 
mode provides a complete function check capability for 
the device. 

The Am29331 is designed with the IMOX™ process which 
allows internal ECL circuits with TTL-compatible I/O. It is 
housed in a 120-lead pin-grid-array package. 









<__] CARRY-IN 
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This document contains information on a product under development at Advanced Micro Devices, Inc. The information is intended to help you to 
evaluate this product. AMD reserves the right to change or discontinue work on this proposed product without notice. 
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Order # 05729B 


RELATED PRODUCTS 


(PartNo. | Deseipion ——SS—*d 


Am29323 32 x 32 Parallel Multiplier 
Am29325 32-Bit Floating Point Processor 
Am29332 32-Bit Extended Function ALU 
Am29334 64 x 18 Four Port, Dual Access 
Register File 




















Figure 1. Am29331 Block Diagram 


> INT RET 
ADOR REG 
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05729B-2 





Do- D5 


Ao-At5 


Mg—3: Mo-3 


Yo-Y5 


lols 
To-T11 
So-S3 
cP 
RST 


FC 


PIN DESCRIPTION 


Data, Bidirectional, Three-State 

Input to address multiplexer, counter, stack, 
and comparator register. Output for stack and 
stack pointer. 


Alternate Data, Input 

Input to address multiplexer and counter. 
Multiway, Input 

Four sets of multiway inputs providing 16-way 
branches. The first index refers to the set 
number. 


Address, Bidirectional, Three-State 
Output of microcode address. Input for inter- 
rupt address. 


Instruction, Input 

Selects one of 64 instructions. 
Test, Input 

Provides external test inputs. 


Select, Input 
Selects one of 16 test conditions. 


Clock Pulse, Input 
Clocks sequencer at the low to high transition. 


Reset, Input 
Resets the sequencer. 


Force Continue, Input 
Overrides instruction with CONTINUE. 


INTR 


INTEN 


INTA 
HOLD 
OEp 
SLAVE 


ERROR 


Cin 
A-FULL 


EQUAL 
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Interrupt Request, Input 
Requests the sequencer to interrupt 
execution. 


Interrupt Enable, Input 
Enables interrupts. 


Interrupt Acknowledge, Bidirectional, 
Three-State, Active Low 
Indicates that an interrupt is accepted. 


Input 
Stops the sequencer and three-states the 
outputs. 


Output Enable D-Bus, Input 
Enables the D-bus driver provided the se- 
quencer is not in the hold or slave mode. 


Input 
Makes the sequencer a slave. 


Output, Three-State 

Indicates a Master/Slave error in the slave 
mode. Indicates a malfunctioning driver or 
contention in the master mode. 


Input, Active Low 
Carry-in to incrementer 


Almost Full, Bidirectional, Three-State 
Indicates that SP = 28. 


Bidirectional, Three-State 
Indicates that the address comparator is ena- 
bled and has found a match. 


ARCHITECTURE 


The major blocks of the sequencer are the address multiplexer, 
the microprogram counter (PC), the stack (with the top of stack 
denoted TOS), the counter (C), the test multiplexer with logic, and 
the address comparison register (R), (Figure 1). The bidirectional! 
D-bus provides branch addresses and iteration counts; it also 
allows access to the stack from outside. The A-bus may be used 
for map addresses. There are four sets of four-bit multiway 
branch inputs (M). The bidirectional Y-bus either outputs micro- 
program addresses or inputs interrupt addresses. The buses are 
all 16 bits wide. Figure 1 snows a biock diagram of the sequencer. 


ADDRESS MULTIPLEXER 


The address multiplexer can select an address from any of five 
sources: 


1) A branch address supplied by the D-bus. 

2) A branch address supplied by the A-bus. 

3) A multiway branch address. 

4) A return or loop address from the top of stack. 

5) The next sequential address from the incrementer. 


MULTIWAY BRANCH ADDRESS 


A multiway branch address is formed by substituting the low- 
er four bits of the address on the D-bus (D3D2D Do) with one 
of the four sets (Mg, My, Mo or Ma) of four-bit multiway branch 
addresses. The multiway branch set is selected by the number 
D Do, while the bits D3 and Do are don't cares. 


ADDRESS REGISTER 


The address register contains the current address. It is loaded 
from the interrupt multiplexer and feeds the incrementer. The 
incrementer is inhibited if Cjy is taken HIGH. 


STACK 


A 33-word deep and 16-bit wide stack provides first-in last-out 
storage for return addresses, loop addresses, and counter val- 
ues. Items to be pushed come from the incrementer, the interrupt 
return address register, the counter, or the D-bus. Items popped 
go to the address multiplexer, the counter, or the D-bus. 


The access to the stack via the D-bus may be used for context 
switching, stack extension, or diagnostics. As the stack is only 
accessible from the top, stack extension is done by temporarily 
storing the whole or some lower part of the stack outside the 
sequencer. The save and the later restore are done with pop and 
push operations respectively at balanced points in the micropro- 
gram, i.e., points with the same stack depth. The internal D-bus 
driver must be turned on when popping anitem to the D-bus; if the 
driver is off, the item will be unstacked instead. The driver is 
normally turned on when the signal Output Enable is asserted 
and the sequencer is not being reset (OEp = 1, RST = 1). 


The stack pointer is a module 64 counter, which is incremented 
oneach push and decremented on each pop. The stack pointer is 
reset to zero when the sequencer is reset, but the pointer may 
also be reset by instruction. Thus, the stack pointer indicates the 
number of items on the stack as long as stack overflow or under- 
flow has not occurred. Overflow happens when an item is pushed 
onto a full stack, whereby the item at the bottom of the stack is 
overwritten. Underflow happens when an item is popped from an 
empty stack, in this case the item is undefined. 


The contents of the stack pointer is present on the D-bus for all 
instructions except POP D, provided the driver is turned on. The 
output signal A-FULL is defined as SP = 28. 
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COUNTER 


The counter may be used as a loop counter. It may be loaded from 
the D-bus, the A-bus or via a pop from the stack. Its contents may 
alsc be pushed onto the stack. 


Anormal! for-loop is set up by a FOR instruction, which loads the 
courter from the D- or A-bus with the desired number of itera- 
tions; the instruction also pushes onto the stack a loop address, 
that points to the next sequential instruction. The end of the loop is 
given by an unconditional END FOR instruction, which tests the 
counter value against the value one and then decrements the 
counter. If the values differ, the loop is repeated by selecting the 
address at the stack as the next address. If the values are equal, 
the loop is terminated by popping the stack, thereby removing the 
loop address, and selecting the address from the incrementer as 
the next address. The number of iterations is a 16-bit unsigned 
number, except that the number zero corresponds to 65536 
iterations. By pushing and popping counter values it is possible to 
handle nested loops. 


ADDRESS COMPARISON 


The sequencer is able to compare the address from the interrupt 
multiplexer with the contents of the comparator register. The 
instruction SET loads the comparator register with the address on 
the D-bus and enables the comparison, while CLEAR disables it. 
The comparison is disabled at reset. A HIGH is present at the 
output EQUAL if the comparison is useful for detection of a 
breakpoint or counting how often a microinstruction at a specific 
address is executed. 


INSTRUCTION SET 


The sequencer has 64 instructions that are divided into four 
classes of 16 instructions each. The instruction lines lg—I5 use Is 
and Iq to select a class and Ig —|3 to select an instruction within a 
class. The classes are: 


Is lg 
0 0 Conditional sequence control, 
0 1 Conditional sequence control with inverted 


polarity, 
1 Q Unconditional sequence control, and 


1 1 Special function with implicit continue. 


Note that for the first three classes Is forces the condition to be 
true and I, inverts the condition. The basic instructions of the first 
three classés are shown in Table 1 and the instructions of the 
fourth class in Table 2. 


Structured microprogramming is supported by sequencer in- 
structions that singly or in pairs correspond to high-level lan- 
guage control constructs. Examples are FOR |: = DDOWNTO 1 
DO...END FOR and CASE N OF ... END CASE. The instruc- 
tions have been given high-level language names where appro- 
priate. Figure 2 shows how to microprogram important control 
constructs; the high-level language is on the left and the mi- 
crocode on the right. 


TEST CONDITIONS 


The condition for a conditional instruction is supplied by a test 
multiplexer, which selects one out of sixteen tests with the select 
lines Sg—S3. Twelve of these are supplied directly by the inputs 
To—144, while the remaining four tests are generated by the test 
logic from the inputs Tg—T44. The following table shows the 
assignments. 


(Tg ® Tio) + T11 


Intended Use 


General 

C (Carry) 

N (Negative) 

V (Overflow) 

Z (Zero or equal) 

C + Z (Unsigned less 

than or equal, borrow mode) 
C + Z (Unsigned less 
than or equal) 

N @ V (Signed less than) 
(N ® V) + Z (Signed less 
than or equal) 


FORCE CONTINUE 


The sequencer has a force continue (FC) input, which overrides 
the instruction inputs Ig —I5 with a CONTINUE instruction. This 
makes it possible to share the microinstruction field for the se- 
quencer instruction with some other control or to initialize a writ- 
able control store. 


RESET 


In order to start a microprogram properly the sequencer must be 
reset. The reset works like an instruction overriding both the 
instruction input and the force continue input. The reset selects 
the address 0 at the address multiplexer, forces the EQUAL 
output to LOW, and disregards a potential interrupt request. It 


synchronously disables the address comparison and initializes 
the stack pointer to 0. 


TABLE 1 


Goto D 

Call D 

Exit D 

End for D, C # 1 
End for D, C = 1 
Goto A 

Call A 

Exit A 

End for A, C a. 
End for A, C = 
Goto M - 
Call M Push PC 
Exit M : Pop 

End for M, C ce 4 => 

End for M, C = - 

End Loop 
Call Coroutine 





TOS<PC 
Return Pop 

End for, C # 1 - 

End for, C = 1 Pop 


Cond. = (Test[S] or ls) XOR Iq 
: = Concatination 
Cc = Counter 


TABLE 2 


ee ee a 


Continue 
For D 
Decrement 
Loop Push PC 

Pop O Pop 

Push D Push D 

Reset SP SPO 

ForA Push PC 

Pop C Pop 

Push C Push C 

Swap TOS<—C 

Push C Load D Push C 

Load D 

Load A 

Set R<D, Enable 
Clear Disable 


4 == ls = HIGH; R = Comp. Register 


Push PC 


ODOOANOanFL ONO 
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INTERRUPTS 


The sequencer may be interrupted at the completion of the cur- 
rent microcycle by asserting the interrupt request input INTR. The 
return address of the interrupted routine is saved on the stack; 
nested interrupts are allowed. An interrupt is accepted if inter- 
rupts are enabled and the sequencer is not being reset or held 
(INTEN = HIGH, RESET = LOW, and HOLD = LOW). 


When there is no interrupt, addresses go from the address mul- 
tiplexer to the Y-bus via the driver and to the incrementer and the 
comparator via the interrupt multinlexer. When there is an inter- 
rupt, the driver of the sequencer is turned off, an external driver is 
turned on, and the interrupt multiplexer is switched. The interrupt 
address is supplied via the external driver to the Y-bus and the 
incrementer and the comparator. In order to save the address 
from the address multiplexer, the address is stored in the interrupt 
address register, which for simplicity is clocked every cycle. The 
next microinstruction is the first microinstruction of the interrupt 
routine. 


In this cycle the address in the interrupt return address register is 
automatically pushed onto the stack. Therefore the microinstruc- 
tion in this cycle must not use the stack; if a stack operation is 
programmed, the result is undefined. The instructions that do not 
use the stack are GOTO D, GOTO A, GOTO M, CONTINUE, 
DECREMENT, LOAD D, LOAD A, SET and CLEAR. A RETURN 
instruction terminates the interrupt routine and the inter- 
rupted routine is resumed. Interrupts only work with a single-level 
contro! path. 


TRAPS 


A trap is an unexpected situation linked to the current mi- 
croinstruction, that must be handled before the microinstruc- 
tion completes and changes the state of the system. An example 
of such a situation is an attempt to read a word from memory 
across a word boundary in a single cycle. When a trap occurs, the 
current microinstruction must be aborted and re-executed after 
the execution of a trap routine, which in the meantime wi!l take 
corrective measures. An interrupt, on the other hand, is not linked 
directly to the current microinstruction that can complete safely 
before an interrupt routine is executed. 


Execution of a trap requires that the sequencer ignores the cur- 
rent microinstruction, selects the trap return address at the ad- 
dress multiplexer, and initiates aninterrupt. This will save the trap 
return address on the stack and issue the trap address from an 
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external source. The address register contains the address of the 
microinstruction in the pipeline register, thus the address register 
already contains the trap return address when a trap occurs. This 
address can be selected by the address multiplexer by disabling 
the incrementer (Cjy = 1), and using the force continue mode 
(FC = 1). In this mode the sequencer ignores the current mi- 
croinstruction. The remaining part of the trap handling is done by 
the interrupt. Thus the section on interrupts also applies to traps. 
There is one exception, however. The interrupt enable cannot be 
used as a trap enable as it does not control the force continue 
mode and the carry-in to the incrementer. 


HOLD MODE 
The sequencer has a hold mode in which operation is suspended. 


When the HOLD signal goes active, the incrementer and the 
outputs (except the D-bus) are disabled and the sequencer en- 
ters the hold mode after the current cycle. While the sequencer is 
in this mode, the internal state is left unchanged and the D-bus is 
disabled. When the HOLD signal goes inactive the incrementer 
and the outputs (except the D-bus) are enabled again and the 
sequencer leaves the hold mode after that cycle. 


In a time multiplexed multi-microprocess system there may be 
one sequencer for all processes with microprogrammed context 
save and restore, or there may be one sequencer per micro- 
process permitting fast process switch. In the latter case the 
Y-buses of the sequencers are tied together and connected to a 
single microprogram store. A control unit decides on a cycle by 
cycle basis, what sequencer should be running and activates the 
HOLD signal to the remaining sequencers. The hold mode has 
higher priority than interrupts, and works independently of the 
RESET signal. The hold mode can only be used with a single- 
level control path. 


MASTER/SLAVE CONFIGURATION 


In some systems reliability is very important. The master/slave 
configuration, that consists of two sequencers operated in paral- 
lel, is able to detect faults in both the interrconnect and the internal 
function of the sequencers. One sequencer is the master and 
operates normally. The other is a slave, i.e., all outputs except the 
signal ERROR are turned into inputs and connected to the out- 
puts of the master. Since the slave is operated in parallel with the 
master, it can compare its result with the result of the master and 
signal an error if they differ. The error signal from the master 
indicates a malfunctioning driver or contention. 


Figure 2A 


Loops with unknown number of iterations: 


REPEAT LOOP 


UNTIL CC END LOOP NOT CC 


WHILE CC DO LOOP 
IF NOT CC THEN EXIT L 


END WHILE END LOOP 
L: 


LOOP LOOP 


IF CC THEN EXIT IF CC THEN EXIT L 


END LOOP END LOOP 


L 


Figure 2C 


Case Statment, 

with D = Ayqs5...Aq4XX00 and Mo, 9-3 = Aglyl90 during 
the GOTO M instruction. AgA;Ap must be 000, and X signifies a 
don't care. 


PUSH DB 
CASE | OF GOTO M 
0: - A: - 
—, RETURN TOB 
A+2: - 
, RETURN TOB 
A+4: - 
, RETURN TOB 
A+é6: - 
= —, RETURN 
END CASE B: 


Figure 2B 


Loop with known number of iterations: 


FOR CNT: = 10DOWNTO1DO FOR D 10 


END FOR END FOR 


Figure 2D 


Double nested if-statement: 


PUSH DC 
IF X THEN IF NOT X THEN GOTO A 

IF Y THEN IF NOT Y THEN GOTO B 
— —, RETURN TO C 
ELSE B: 

7 —, RETURN TO C 
END IF 
ELSE A: 

IF Z THEN IF NOT Z THEN GOTO D 
- —, RETURN TO D 
ELSE D: 
= —, RETURN TO C 
END IF 
END IF C: 
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Opcode Mnemonics 
(Is — Io) 


32 BRA —D 
36 BRA_A 
40 BRA_M 
44 BRA_S 

0 BRCC_D 

4 PRCC_A 

8 BRCC_M 
12. BRCC_S 
16 BRNC_D 
20 BRNC_A 
24 BRNC_M 
28 BRNC_S 


INSTRUCTION SET DEFINITION 


Legend: @ = Other instruction 
© = Instruction being described 


Description 


Go to D. Unconditional branch to the ad- 
dress specified by the D inputs. 


Go to A. Unconditional branch to the ad- 
dress specified by the A inputs. 


Go to M. Unconditional branch to the ad- 
dress specified by the D inputs catenated 
with the multiway M inputs. 

Go to TOS. Unconditional branch to the 
address on the top of the stack. Also 
End Loop when used to terminate 
WHILE .. . ENDWHILE loops. 


If CC is HIGH then branch to the address 
specified by the D inputs else continue. 


If CC is HIGH then branch to the address 
specified by the A inputs else continue. 
If CC is HIGH then branch to the address 
specified by the D inputs catenated with 
the multiway M inputs else continue. 

If CC is HIGH then branch to the address 
on the top of the stack else pop the stack 


and continue. Also End Loop when used 
to terminate REPEAT ... UNTIL loops. 


If CC is LOW then branch to the address 
specified by the D inputs else continue. 


lf CC is LOW then branch to the address 
specified by the A inputs else continue. 


If CC is LOW then branch to the address 
specified by the D inputs catenated with 
the multiway M inputs else continue. 


If CC is LOW then branch to the address 
on the top of the stack else pop the stack 
and continue. Also End Loop when used 
to terminate REPEAT... UNTIL loops. 


Note: Opcode numbers are in decimal notation. 
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P = Test pass 
F = Test fail 
O = Register in part 


50 


51 


52 90 
91 
92 
05729B-3 
50 
51 
52 F 
53 90 
Pp 
91 
92 
05729B-4 
50 
51 
52 F 
53 90 
Pp 
91 
92 
05729B-5 


Opcode Mnemonics Description 
(Is - lo) 
33 


52 
Call M. Unconditional branch to the sub- 


routine address specified by the D inputs 
catenated with the multiway M inputs and 
push the PC on the stack. 


CALL__S Call TOS. Exchange PC and TOS. Also 
call coroutine. 


lf CC is HIGH then call the subroutine 

address specified by the D inputs else 

continue. 

If CC is HIGH then call the subroutine 

address specified by the A inputs else STACK 

continue. F O=x— Pc+) 
7 S54 

If CC is HIGH then call the subroutine are 

address specified by the D inputs cate- P 

nated with the multiway M inputs else ii 

92 


CALL__D Call D. Unconditional branch to the sub- 
routine address specified by the D inputs 
and push the PC on the stack. 50 
Call A. Unconditional branch to the sub- rach 
routine address specified by the A inputs a Dora Pos 
and push the PC on the stack. 4 aa 
91 
92 
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continue. 


If CC is HIGH then call the address on the 
top of the stack else continue. Also used 
for conditional coroutine calls. 


lf CC is LOW then call the address 

specified by the D inputs else continue. 

If CC is LOW then call the address 

specified by the A inputs else continue. STACK 

If CC is LOW then call the address = kee 
specified by the D inputs catenated with aos 

the multiway M inputs else continue. P 
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If CC is LOW then call the address on the 
top of the stack else continue. Also a con- 
ditional coroutine call. 


91 


92 
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EXIT._D Exit to D. Unconditional branch to the ad- 

dress specified by the D inputs and pop 

the stack. a 
EXIT_A Exit to A. Unconditional branch to the ad- 

dress specified by the A inputs and pop 51 

the stack. ye 
Exit to M. Unconditional branch to the ad- —O 
dress specified by the D inputs catenated STACK 
with the multiway M inputs and pop the 
stack. 


05729B-9 
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Opcode Mnemonics Description 
(Is = Io) 
2 XTCC_D If CC is HIGH then exit to the address 
specified by the D inputs and pop the 
stack else continue with no pop. 


If CC is HIGH then exit to the address 
specified by the A inputs and pop the 
stack else continue with no pop. 


If CC is HIGH then exit to the address 
specified by the D inputs catenated to the 
multiway M inputs and pop the stack else 
continue with no pop. 


STACK rs 
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STACK 
PC +1 


18 XTNC_D If CC is LOW then exit to the address 
specified by the D inputs and pop the 
stack else continue. 


22 XTNC_A If CC is LOW then exit to the address STACK 7” 
specified by the A inputs and pop the 
stack else continue. 
26 XTNC_M If CC is LOW then exit to the address 
specified by the D inputs catenated with 
the multiway M inputs and pop the stack 
else continue. 
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DJMP._D If the counter is not equal to one then 
decrement the counter and branch to the 
address specified by the D inputs else 
continue. 


If the counter is not equal to one then 
decrement the counter and branch to the 
address specified by the A inputs else 
continue. 


If the counter is not equal to one then 

decrement the counter and branch to the 

address specified by the D inputs cate- COUNTER +1) COUNTER 

nated with the multiway M inputs else --O-— count:-1 


continue. 
D COUNTER = 1 


DJMP__S If the counter is not equal to one then 
decrement the counter and branch to the 
address on the top of the stack else de- 
crement the counter, pop the stack and 
continue. 


05729B-12 
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Opcode Mnemonics 
(Is — lo) 
3 DJCC_D 


DJNCC.__D 


DJNCC_A 


DJNCC_M 


Description 


If CC is HIGH and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the D 
inputs else decrement the counter and 
continue. 


If CC is HIGH and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the A 
inputs else decrement the counter and 
continue. 


If CC is HIGH and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the D 
inputs catenated with the multiway M 
inputs else decrement the counter and 
continue. 


If CC is HIGH and the counter is not equal 
to one then decrement the counter and 
branch to the address on the top of the 
stack else decrement the counter, pop the 
stack and continue. 


If CC is LOW and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the D 
inputs else decrement the counter and 
continue. 


if CC is LOW and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the A 
inputs else decrement the counter and 
continue. 


\f CC is LOW and the counter is not equal 
to one then decrement the counter and 
branch to the address specified by the D 
inputs catenated with the multiway M 
inputs else decrement the counter and 
continue. 


\f CC is LOW and the counter is not equal 
to one then decrement the counter and 
branch to the address on the top of the 
stack else decrement the counter, pop the 
stack and continue. 


Unconditional return from subroutine. 


If CC is HIGH then return from subroutine 
else continue. 


If CC is LOW then return from subroutine 
else continue. 
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P AND COUNTER 


COUNTER # 1 
--(O—— count-1 


FOR 
COUNTER = 1 
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P AND COUNTER 


COUNTER + 1 
--O-=— count-1 


FOR 
COUNTER = 1 
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Opcode Mnemonics Description 
(Is — Io) 

49 FOR _D Initialize loop. Push the PC on the stack, 
load the counter with the value of the D 
inputs and continue. Use with DJMP__S 
for FOR... NEXT loops. 


Initialize loop. Push the PC on the stack, 
load the counter with the value of the A 
inputs and continue. Use with DJMP__S 
for FOR .. . NEXT loops. 


Initialize loop. Push the PC and continue. 
Use with BRCC_S for REPEAT .. . UN- 
TIL loops or with XTCC__D and BRA_S 
for WHILE .. . ENDWHILE loops. 
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POP__D Pop the stack, output the value on the D 
outputs and continue. 


POP_C Pop the stack, place the value in the 
counter and continue. 


PUSH__D Push the D inputs on the stack and 
continue. 


PUSH_C Push the counter on the stack and 
continue. 


SWAP Exchange the counter and the top of stack 
and continue. 


COUNTER 


05729B-17 





Opcode Mnemonics 
(Is. — 49) 
59 STACK_C 


LOAD__D 


LOAD__A 


48 CONT 
50 DECR 
54 RESET__SP 


63 CLEAR 


Description 


Push the counter on the stack, load the 
counter with the value of the D inputs and 
continue. 


Load the counter with the value of the D 
inputs and continue. 


Load the counter with the value of the A 
inputs and continue. 


Continue. 
Decrement the counter and continue. 
Reset the stack pointer and continue. 


Load the comparison register with the 
value of the D inputs, enable the com- 
parator and continue. 


Disable the comparator and continue. 


COUNTER 


COUNTER 


7 
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50 g 


51(0) 


52 g 
COUNTER 
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Am29332 


32-Bit Arithmetic Logic Unit 


ADVANCED INFORMATION 


DISTINCTIVE CHARACTERISTICS 


Single Chip, 32-Bit ALU 

Supports 80-90ns microcycle time for the 32-bit 
data path. It is a combinatorial ALU with equal cy- 
cle time for all instructions. 

Flow-through Architecture 

A combinatorial ALU with two input data ports and 
one output data port allows implementation of either 
parallel or pipelined architectures. 

64-Bit In, 32-Bit Out Funnel Shifter 

This unique functional block allows n-bit shift-up, 
shift-down, 32-bit barrel shift or 32-bit field extract. 


@ Supports All Data Types 
It supports one-, two-, three- and four-byte data for 
all operations and variable-length fields for logical 
operations. 
Multiply and Divide Support 
Built-in hardware to support two-bit-at-a-time modi- 
fied Booth's algorithm and one-bit-at-a-time division 
algorithm. 
Extensive Error Checking 
Parity check and generate provides data transmis- 
sion check and master/slave mode provides com- 
plete function checking. 


GENERAL DESCRIPTION 


The Am29332 is a 32-bit wide non-cascadable Arithmetic 
Logic Unit (ALU) with integration of functions that normally 
don't cascade, such as barrel shifters, priority encoders 
and mask generators. Two input data ports and one output 
data port provide flow-through architecture and allow the 
designer to implement his/her architecture with any degree 
of pipelining and no built-in penalties for branching. Also, 
the simplicity of a three-bus ALU allows easy implementa- 
tion of parallel or reconfigurable architectures. The register 
file is off-chip to allow unlimited expansion and regular 
addressability. 


The Am29332 supports one-, two-, three- and four-byte 
data for arithmetic and logic operations. It also supports 


multiprecision arithmetic and shift operations. For logical 
operations, it can support variable-length fields up to 32 
bits. When fewer than four bytes are selected, unselected 
bits are passed to the destination without modification. The 
device also supports two-bit-at-a-time modified Booth's 
algorithm for high-speed multiplication and one-bit-at-a- 
time division. Both signed and unsigned integers for all byte 
aligned data types mentioned above are supported. 


The Am29332 is designed to support 80-90 ns microcycle 
time. The device is packaged in a 168-lead pin-grid-array 
package. 
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Order #05730B 


PA3- PAo 


DA31—-DAg 
PB3-PBo 


DB31-DBo 
PY3-PYo 


Y31-Yo 


C, Z, N,V, L 


RELATED PRODUCTS 


[Part Nor_[Beserption SSCS 


PIN DESCRIPTION 


Parity input for operand A on DA-bus (one 
per byte). 
Data input lines for operand A. 


Parity input for operand B on DB-bus (one 
per byte). 
Data input lines for operand B. 


Parity output for data on Y-bus (one per 
byte). 


Data Input/Output Lines 

When OE-Y is LOW and the ALU is in the 
Master mode, the ALU result is enabled 
on the Y-bus. When OE-Y is HIGH, the Y- 
bus is tristated. In Slave mode the Y-bus 
acts as externa! data input. 


Instruction Inputs 


Byte width inputs for byte boundary 
aligned operand instructions. Selects the 
sources for width and position inputs for 
variable field bit operands. If l7 is LOW it 
selects the width inpuf from pins 
W4 - Wo. If I7 is HIGH the width input is 
selected from the internal width register. 
Similarly if Ig is LOW it selects the 
position inputs from pins P5—Po and if 
HIGH it selects input from the internal 
position register. 


Width input to select the width of a 
contiguous bit field. 


Position input to select the position of the 
least significant bit of a field. Also 
indicates the amount by which data is to 
be shifted up (P5 =LOW) or down 
(P; = HIGH) or rotated. 


When the Register Status pin is LOW, 
these pins give the carry, zero, negative, 
overflow and link outputs of the ALU 
where applicable to the instruction being 
executed. When not applicable to the 
instruction being executed, or when the 
Register Status pin is HIGH, these pins 
give the outputs of the carry, zero, 
negative, overflow and link bits of the 
internal status register. In SLAVE mode, 
C, Z, N, V and L become inputs. 


Register 
Status 


Borrow 


Macro Carry 
Macro Link 


Macro/Micro 
SEL 


MS-Error 


Parity-Error 
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Register Status Mode Pin 

Selects between ALU status (Register 
Status = LOW) or register status 
(Register Status = HIGH) on the C, Z, N, 
V and L outputs. 


When HIGH it inhibits the update of the 
status and Q registers. 


Clocks internal registers (status, Q) at the 
LOW to HIGH transition, provided HOLD 
input is LOW. 


Output Enable 
When OE-Y is HIGH the Y-bus is disabled 
(tri-stated). 


When HIGH the Carry In and Carry Out 
are borrows for subtract operations. 


Macro Status Carry Input 
Macro Status Link Input 


When HIGH selects macro carry and 
macro link pins as input instead of micro 
carry and micro link from the micro-status 
register. 


When HIGH this pin puts the ALU in the 
slave mode. All output pins become input 
pins and signals on them are compared 
with the ALU's internally generated 
results. When OE-Y is HIGH, the Yo - Y31 
and PYo —-PY3 inputs are ignored. When 
the SLAVE pin is LOW, the ALU is put in 
master mode where outputs are 
generated as normal. 


Master-Slave Error 

When HIGH this signal indicates that the 
master's and slave's data were not 
identical. 


When HIGH indicates that a parity error 
was detected on the DA or DB inputs. 
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Figure 1. Am29332 Family High Performance System Block Diagram 


PRODUCT OVERVIEW 


The Am29332 is a 32-bit wide, high performance, non- 
expandable Arithmetic Logic Unit. It has two 32-bit wide input 
ports (A and B) and one 32-bit wide output port (Y). These 
three ports provide flexibility and accessibility for high-perfor- 
mance processor designs. Dedicated input and output ports 
provide a flow-through architecture and avoid the penalty 
associated with switching the bus half-way thrcugh the cycle 
for input and output of data. The chip is designed for use with 
a dual access RAM (Am293334) as a register file. In addition, 
the three bus architecture facilitates the connection of other 
arithmetic units in parallel with the Am29332 for high perfor- 
mance systems. 


The Am29332 supports one-, two-, three- and four-byte 
arithmetic operations. It also supports multiprecision arithme- 
tic and multiple bit shifts. For logical operations, it can handle 
variable-length fields of up to 32 bits. The chip incorporates 
dedicated hardware to allow efficient implementation of a two 
bit-at-a-time (modified Booth) multiply algorithm, supporting 
signed and unsigned arithmetic data types. Similarly, hardware 
is provided to support a bit-at-a-time divide algorithm, also 
supporting signed and unsigned arithmetic data types. An 
internal 32-bit register (Q) is used by the multiply and divide 
hardware for double precision operands. For business applica- 
tions, the Am29332 supports variable-length BCD arithmetic. 


Field logical instructions operate on bit-fields taken from the A 
and B data inputs; they may be of variable width and starting 
position. A is normally thd source input and B the destination 
input. In general, destination bits not falling within a specified 
field are passed by the ALU unchanged. Field width and 
position are specified either by direct inputs to the chip, or by 
entries in the status register. There are two kinds of field 
logical instructions ~— aligned and non-aligned. The first type of 
instruction assumes that source and destination fields are 
aligned and the operation is performed only for bits within the 
specified fields. In the second type of instruction, source and 
destination fields are normally non-aligned. However, it is 
always assumed that one field (either source or destination) is 
least significant bit (LSB) aligned. 


If the destination field is LSB aligned then the source field is 
downshifted in order to make it LSB aligned as well. Down- 
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shifting is accomplished by making the 6-bit position input 
equal to the two's complement of the number of places the 
field is to be downshifted. If the source field is LSB aligned 
then it is upshifted in order to align it with the destination. 
Upshifting is accomplished by making the position inputs equal 
to the number of places the field is to be upshifted. Any other 
type of field operation is not allowed. Whenever the field 
crosses the word boundary, the portion not falling within the 
word boundary is ignored. This effect is useful when perform- 
ing operations on fields that overlap two different words. 
Instructions to perform straightforward multiple-bit shifts (ei- 
ther up or down) are also provided. Additionally, it is possible 
to extract a bit-field from a word in one instruction, even if that 
field overlaps a word boundary. 


The power and the flexibility of the processor comes partly 
from its ability to generate a mask to control the width of an 
operation for each instruction without any overhead. For all 
byte aligned instructions (three quarters of the instruction set), 
the mask is either 1, 2, 3 or 4 bytes wide and is generated from 
the byte width input (lg - 17). For all field instructions the mask 
is of variable width and is generated from the position inputs 
(Ps - Po) and the width inputs (W4 —- Wo). Whenever the width 
of the operand is less than 32-bits, all unselected bits from the 
inputs of the ALU are passed to the output without any 
modification. Depending upon the instruction type, unselected 
bits are taken from different sources. For example in all single 
operand instructions, bits from the source operand (from 
either A or B input) are passed in unselected bit positions. For 
two operand instructions, bits from the B input are passed in 
unselected bit positions. There are some exceptions which are 
explained in the instruction set section. 


The processor has a 32-bit status register to indicate the 
status of different operations performed. The status register is 
loaded at the rising edge of the clock with new status unless 
the HOLD signal is HIGH. The bit position for each status bit is 
given in the functional description. The least significant byte of 
the status register holds the six position bits (P5 — Po). The two 
most significant bits of this byte may be read or loaded but are 
otherwise unused by the ALU. The second byte (bits 8 to 15) 
consists of the five width bits (W4— Wo) and three read-only 
bits that are a combinational function of other status bits, and 
which indicate useful branch conditions. The third byte con- 





sists of ALU status bits plus bits for high speed multiply and 
divide. The most significant byte holds intermediate nibble 
carries for BCD operations. An extract-status instruction is 
provided which allows a Boolean value to be formed from any 
selected bit. This is particularly useful in machines employing a 
stack architecture. Instructions to save and restore the status 
register are provided. As the entire status of each instruction is 
stored in the status register, interrupts at any microinstruction 
boundary are feasible. 


The processor has a 32-bit wide priority encoder to support 
floating-point and graphics operations. The priority encoder 
supports all byte aligned data types — the result is dependent 
upon the byte width specified. The result of a priority encode is 
also loaded into the position bits of the status register. The 
result of the prioritize operation can then be used in the 
following clock cycle, e.g., to normalize a floating-point num- 
ber or to help detect the edge of a polygon in graphics 
applications. 


To support system diagnostics, ihe Am29332 has a special 
"'Master-Slave"’ mode. To use this mode, two chips are 
connected in parallel, and hence receive the same instructions 
and data. The master chip is used for the normal data path. 
However, in the slave chip, all outputs becomes inputs. The 
slave compares the outputs of the master with its own 
internally generated result. If the two do not match, the slave 
will activate an error signal. 


As a further diagnostic aid, byte-wise parity checking is 
performed at both the A and B data inputs. The ''parity"’ signal 
is activated if an error is detected. Parity bits (one per byte) are 
generated for the 32-bit output bus. 


FUNCTIONAL DESCRIPTION 


A detail description of each functional block is given in the 
following paragraphs. 


64-Bit Funnel Shifter 


The 64-bit funnel shifter is a combinatorial network. The 64-bit 
input is formed from a combination of the A and B inputs. This 
may be left-shifted by up to 31 bits before being used by the 
ALU. The output of the shifter is the most significant 32 bits of 
the result. The 64-bit shifter can be used on either the A or B 
operands to perform barrel shifts (either up or down) or 
rotates. The operation is controlled by positioning operands 
properly at the input of the 64-bit up-shifter. 


The number '"'n"' by which operand is shifted comes from two 
sources: the microprogram memory via the Ps — Po pins or the 
internal register (byte 0 of the status register), as selected by 
an instruction bit. 


In general, the 6-bit position input, P5 — Po, takes a 6-bit two's 
complement number representing upshifts from 0 to 31 places 
(positive numbers) or downshifts from 1 to 32 places (negative 
numbers). 


Mask Generator 


The mask generator logic provides the ability to generate the 
appropriate mask for an operand of given width and position. 
The generation of the mask depends upon two types of 
instructions. The first type has byte boundary aligned oper- 
ands (widths of either 1, 2, 3 or 4 bytes) with the least 
significant bit aligned to bit 0. The width of an operand is 
specified by the byte width inputs (Ig and I7) as shown in Table 
1. The second type of instruction has operands of variable 
width (1 to 32 bits) and position. The operand is specified by 
the width inputs (W4-Wo) and the position inputs (P5 — Po) 
indicating the least significant bit position of the operand. 
Thus, in this type of instruction the operand may or may not be 


least significant bit aligned. Depending upon the type of 
instruction, the mask generator first generates a fence of all 
zeros starting from the least significant bit with the width 
specified either by the byte width or the width input fields. This 
fence can be upshifted by up to 31 bits by the 32-bit mask 
shifter. Whenever the mask is moved up over the 32-bit 
boundary, it does not wrap around. Instead, ONE's are 
inserted from the least significant end. This configuration 
provides the ability to operate on a contiguous field located 
anywhere in a word, or across a word boundary. 


The mask generator can be used as a pattern generator by 
allowing the mask to pass through ALU (by using the PASS- 
MASK instruction). For example, a single-bit wide mask can be 
generated and by shifting it up by different amounts can give 
walking ONE or walking ZERO patterns for memory tests. 


TABLE 1. 


Arithmetic and Logical Unit 


The ALU is a three input unit which uses the mask as a second 
or third operand in every instruction. The mask is used to 
merge two operands. For all selected bits (wherever the mask 
is 0), the desired operation specified by the instruction input is 
performed, and for all unselected bits either corresponding 
destination bits or zeros are passed through. The status of 
each operation (carry, negative, zero, overflow, link) applies to 
the result only over the specified width. For all byte aligned 
arithmetic and logical operations (first three quarters of the 
instruction set), the status is extracted from the appropriate 
byte boundary. For all field operations (last quarter of the 
instruction set), the operand width is assumed to be 32 bits for 
status generation. The ZERO flag always indicates the status 
of all bits selected by the mask. 


The actual width of the ALU is 34 bits. There are two extra bits 
used for the high speed signed and unsigned multiplication 
instructions. These two bits are automatically concatenated to 
the most-significant end of the ALU depending upon the width 
specified for the operation. Since the modified Booth algorithm 
requires a two-bit down-shift each cycle, these ALU bits 
generate the two most-significant bits of the partial product. 


The ALU is capable of shifting data down by two bits for the 
multiplication algorithm, up by one bit for the divide algorithm 
and single-bit-up-shifts. 


The processor is capable of performing BCD arithmetic on 
packed BCD numbers. The ALU has separate carry logic for 
BCD operations. This logic generates nibble carries (BCD digit 
carry) from propagate and generate signals formed from the A 
and B operands. In order to simplify the hardware while 
maintaining throughput, the BCD add and subtract operations 
are performed in two cycles. In the first cycle, ordinary binary 
addition or subtraction is performed and BCD nibble carries 
are generated. These are blocked from affecting the result at 
this stage, but are saved in the status register to be used later 
for BCD correction. In the second cycle all BCD numbers are 
adjusted by examining the previously generated nibble carries. 
Since all the necessary information is stored in the status 
register, the processor can be interrupted after the first BCD 
cycle. 
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Priority Encoder 


The priority encoder is provided to support floating-point 
arithmetic and some graphics primitives. The priority encoder 
takes up to 32 bits as input and generates a 5-bit wide binary 
code to indicate location of the most significant one in the 
operand. Input to the priority encoder comes from the input 
multiplexer, which masks all bits that the user does not want to 
participate in the prioritization. The priority encoder supports 8, 
16, 24 and 32-bit operations depending upon the byte width 
specified. For each data type the priority encoder generates 
the appropriate binary weighted code. For example, when a 
byte width of two is specified, the output of the encoder is zero 
when bit 15 is HIGH. However, if byte width of four is specified 
(lg -17 = 00), the output of encoder is 16 (decimal) if bit 15 is 
HIGH and bits 31 - 16 are LOW. Table 2 shows the output for 
each data type. If none of the inputs are HIGH or the most 
significant bit of the data type specified is HIGH then the 
output is zero. The difference between these two cases is 
indicated by the Z-flag of the status register which is HIGH 
only if all inputs are zero. 


Q-Register 


The Q-register holds dividend and quotient bits for division, 
and multiplier and product bits for multiplication. During 
division, the contents of the Q-register are shifted left, a bit at 
a time, with quotient bits inserted into bit 0. During multiplica- 
tion, the contents of the Q-register are shifted right, two bits at 
a time, with product bits inserted into the most-significant two 
bits (according to the selected byte width). The Q-register may 
be loaded from the A or B inputs and read onto the Y bus. 


Master-Slave Comparator 


All ALU outputs (except MS-Error) employ tri-state buffers. 
The master-slave comparator compares the input and output 
of each buffer. Any difference causes the MS-Error signal to 
be made true. In SLAVE mode, all output buffers are disabled. 
Outputs from a second ALU may then be connected to the 
equivalent pins of the first. The comparator in the slave will 
then detect any difference in the results generated by the two. 
When the Y bus is tri-stated by making Output-Enable false, 
the Y bus master-slave comparators are disabled. 


Parity Logic 


For each byte of the DA and DB inputs there is an associated 
parity bit (8 in all). If a parity error is detected on any byte, the 
PARITY-ERROR signal is made true. Four parity signals (one 
per byte) are also generated for the Y bus outputs. EVEN 
parity is employed for the Am29332. 


Status Register 


All necessary information about operations performed in the 
ALU is stored in the 32-bit wide status register after every 
microcycle. Since the register can be saved, an interrupt can 
occur after any cycle. The status register can be loaded from 
either the A or B input of the chip and can be read out on the Y 
bus for saving in an external register file. For loading, the byte 
width indicates how many bytes are to be updated. The status 
register is only updated if the HOLD input is inactive. 


Each byte of the status register holds different types of 
information (see Figure 3). The least significant byte (bits 0 to 
7) holds six position bits for the data shifter. The two most 
significant bits are not used. The next most significant byte 
(bits 8 to 15) holds the 5-bit width field for the mask generator. 
The three most-significant bits of that byte (bits 13 to 15) are 
read-only bits that represent three different conditions ex- 
tracted from the other bits of the status register. They are 
C+2Z, N ® V, and (N ® V)+Z for bits 13, 14 and 15 


respectively. These bits can be read on the Yo pin by the 
extract-status instruction. The next byte contains all the 
necessary information generated by an ALU operation. The 
least-significant four bits (bits 16 to 19) hold carry, negative, 
overflow and zero flags. Bit 20 holds link information for single 
bit shifts and bits 21 and 22 are used by the multiply and divide 
instructions. The M flag holds the multiplier bit for the modified 
Booth algorithm or it holds the sign comparison result for the 
divide algorithm. The S flag holds the sign of the partial 
remainder for unsigned division. Both the flags (M and S) are 
provided as a part of the status register so that multiply and 
divide instructions can be interrupted at microinstruction 
boundaries. The most significant byte of the status register 
holds nibble carries for BCD arithmetic. Since BCD arithmetic 
is performed in two cycles, the nibble carries are saved in the 
first cycle and used in the second cycle. Since all the 
information is stored, BCD instructions are also interruptible at 
the microinstruction boundary. 


TABLE 2. 


Highest Priority Encoder 
Active Bit Output 


Byte Width = 00 (32-bit) 
None 
31 
30 
29 
28 


Byte Width = 01 (8-bit) 
None 


Byte Width = 10 (16-bit) 
None 
15 
14 
13 
12 


Byte Width = 11 (24-bit) 
None 
23 
22 
21 
20 
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Figure 1. ALU Status Register Bit Assignment 


Am29332 INSTRUCTION SET 
Data Types 
The Am29332 supports the following data types: 


1. Integer 
2. Binary coded decimal 
3. Variable-length bit field 


The first two data types fall into the category of byte boundary 
aligned operands (Figure 2). The size of the operand could be 
1 byte, 2 bytes, 3 bytes or 4 bytes. All operands are least 
significant bit (bit 0) aligned. The byte width is determined by 
bits Ig and !7 of the instruction as shown in Table 3. 


TABLE 3. 


Width in 
I7 Bytes 


The third data type has operands of variable width (1 to 32 
bits) as shown in Figure 2. The operand is specified by width 
inputs (W4-Wo) and position inputs (P5—-Po). The position 
inputs indicate the least significant bit position of the operand. 
Depending on bits Ig and !7 of the instruction, the width and 
position inputs can be selected from either the Status Register 
or the Width and Position Pins as shown in Table 4. A 
summary of the data types available is illustrated in Table 5. 


TABLE 4. 


|. Byte Boundary Aligned Operands 


7 


TBO00096 
ll. Variable-Length Bit Field 


P+ W-41 PP-1 0 


W-1 0 


TBO000097 


P = Bit displacement of the least significant field with re- 
spect to bit 0. 


W = Width of field in bits. 


Figure 2. 


[ata Tyre] sue | 


Integer 

1 byte 

2 bytes 

3 bytes 

4 bytes -231 to 231-4 

BCD Numeric, 2 digits per byte. 
Most-significant digit may be 
used for sign. 

Dependent on position and 
width inputs. 


Variable 1 to 32 bits 





INSTRUCTION FORMAT 
The Am29332 has two types of Instruction Formats: 
1. Byte Boundary Aligned Instructions 


\g '7 


ig i) 


TB000098 
2. Variable-Length Field Bit Instructions 


Ng Ie Ip 


TBOO00099 


For instructions which allow a field to be shifted up or down, 
Ps5-Po is a two's complement number in the range -32 to 
+31 representing the direction and magnitude of the shift. For 
instructions which assume a fixed field position, P4—-Po 
represent the position of the least-significant bit of the field 
and Pg is ignored. 


Instruction Classification 
ALU instructions can be classified as follows: 


A. Byte Boundary Aligned Operand Instructions: 


1. Arithmetic 
— Binary, BCD 
~ Multiply steps 
— Division steps (single and multiple precision) 


Prioritize 


2. 

3. Logical 
4. Single-bit shifts 
5. 


Data movement 
B. Variable-Length Bit Field Operand Instructions: 
1. N-bit shifts and rotates 
2. Bit manipulations 
3. Field logical operations (aligned, non-aligned, extract) 
4. Mask generation 


Three-fourths of the ALU instructions apply to operands that 
are byte boundary aligned. For these instructions, two orthog- 
onal issues are the width of the operand (in bytes) and the 
contents of the high order unselected bytes on the Y bus. As 
mentioned earlier, the width of the operand is specified by lg 
and I7. With the exception of a few instructions, the unselected 
bytes are assigned values as follows: for single operand 
instructions, unselected bytes are passed unchanged from the 
source (A or B). For two operand instructions, unselected 
bytes are passed unchanged from the destination (B input). 


In the last quarter of the instruction set, the width of the 
operand is from 1 to 32 bits (based on the width input) for field 
operations, 32 bits for N-bit shift operations and 1-bit for bit- 
oriented operations. In the case of field-aligned and single-bit 
operands, the position bits (P4-Po) determine the least 
significant bit of the operand. In the case of N-bit shifts and 
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field non-aligned operands, the position bits Ps — Po is a 6-bit 
signed integer determining the magnitude and direction of the 
shift. 


The operation of each instruction can be explained by the use 
of a collection of handy functions. The most common of these 
describes a fundamental property of the ALU: 


Merge (X, Y, Mask) 


Here the selected bits (determined by the Mask) pass X while 
the unselected bits pass Y. 


Most single byte boundary aligned operand instructions of the 
ALU can be explained by: 


Y «- Merge (f(operand), operand, bytemask) 


where bytemask itself is a function of byte width and can be 
denoted by bytemask = mask (byte width). The function 
bytemask returns a mask consisting of ones in the least 
significant bytes (selected by byte width) and zeros in the 
remaining bytes. In the above operation, the result of the 
function is returned in the least significant bytes selected by 
the mask and the operand in the high order unselected bytes. 


Similarly two-operand instructions can be explained by: 
Y ~- Merge (f(A, B), B, bytemask) 


The only difference is that here the operation is done on two 
operands and that the unselected high order bytes always 
pass the B operand. 


The shift operation on byte boundary aligned can be explained 
by: 


Up/Down Shift (operand, fill-bit, byte-width) 


where byte-width determines the number of bytes to be shifted 
and fill-bit is the bit shifted in. 


The variable bit field operations can also be explained in the 
same manner: 


Y + Merge (f(A, B), B, bitmask (position, width)) 


The mask in this case is a bitmask and is a function of position 
and width. Position determines the position of the least 
significant bit of the selected field, and width determines the 
number of higher order bits selected. Mask bits are HIGH for 
selected bits in the word and LOW for the remaining bits. The 
function is done on only the selected bits; a pass of the source 
operand on unselected bits for single operand instructions and 
operand B for two operand instructions is performed. 


Flags 
Byte-Aligned Instructions: 


The zero flag always looks only at the selected bytes: 
Z + (Y and bytemask (byte width) = 0) 


Similarly, N = sign bit (Y, byte width), where the function 
"sign-bit" returns bit 7, 15, 23, or 31 of the first argument for 
byte widths 01, 10, 11, or 00 respectively. 


Also, C + carry (byte width) returns the carry from the 
appropriate byte boundary, and: 


V = overflow (byte width) 
returns the overflow from the appropriate byte boundary. 


The tink (L) flag is generally loaded with the bit moved out of 
the highest selected byte in the case of upshifts, or the bit 
moved out of the least significant byte for downshifts. Other 
status flags have specialized uses, explained in the following 
sections. 





Variable-Length Field Instruction: the outputs of the corresponding bits in the status register. If 
the direct status output is selected, then for instructions that 


Generally, only N and Z are affected. N takes the most- do not affect a particular flag (e.g., carry for logical arithmetic) 
significant bit of the 32-bit result (i.e, N = Yg1). 2 detects that output will reflect the state of its corresponding bit in the 
zeros in the selected field of the result (ie, 2 ~ (Y and status register. Similarly, when the HOLD signal is made 
bitmask (position, width) = 0). HIGH, the C, Z, N, V and L pins will be made equal to the 


contents of the status register, regardless of the RS input. 


Output Select 


The Register Status pin may be used to switch the C, Z, N, V, 
and L output pins between the direct output of the ALU and 


INSTRUCTION SET SUMMARY 


Operand Size: Variable Byte Width: 1, 2, 3, 4 Bytes [Ro ee 


























Binary Integer 
















e Negate (two's complement) 


Data Type 
and BCD 
e Multiply steps (modified Booth) (Signed and unsigned) 


e Increment by one, two, four 
e Decrement by one, two, four 
e Add, adde (carry = macro/micro) 
Arithmetic 
Binary Integer 
e Divide steps (non-restoring) 


e Sub, subr 
Single-Bit e@ Upshift with 0, 1, link fill 










e Subc, subre (carry/borrow) 

e BCD sum and difference correct steps 

 Downshift with 0, 1, link, sign fill (Single and double precision) 
e Zero extend 
e Sign extend 

Rak aer e Pass-status, Q-Reg Binary 

e Load-status, Q-Reg 
e Merge 


Operand Size: 32 Bits 


Data Type 
see e Upshift by 0 to 31 bits with 0 fill 
NER slahegae  Downshift by 1 to 32 bits with 0, sign fill Binary 
e Rotate by 0 to 31 bits 


Operand Size: Single Bit 
Data Type 





Bit e Extract 
Manipulation * Sct Binary 
P e Reset 


Operand Size: Variable Length Bitfield: 1 to 32 Bits 
Data Type 


Field Logical 
(aligned and e Not, OR, XOR, AND, extract, insert Binary 
non-aligned) 


| Mask |e Passsmask Binary 
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Am29334 


Four-Port, Dual-Access Register File 


ADVANCED INFORMATION 


DISTINCTIVE CHARACTERISTICS 


e@ Fast 
With an access time of 20ns, the Am29334 sup- 
ports 80-90ns microcycle time when used with the 
Am29300 Family for 32-bit systems. 
64x 18 Bits Wide Register File 
The Am29334 is a high-performance, high-speed, 
dual-access RAM with two READ ports and two 
WRITE ports. 
Cascadable 
The Am29334 is cascadable to support either wider 
word widths, deeper register files, or both. 


© Simplified Timing Control 
Control for write enable timing and for on-chip 
read/write multiplexer are derived from a single- 
phase clock input. 
Byte Parity Storage 
Width of 18 bits facilitates byte parity storage for 
each port and provides consistency with the 
Am29332 32-bit ALU. 
Byte Write Capability 
Individual byte-write enables allows byte or full word 
write. 


GENERAL DESCRIPTION 


The Am29334 is a 64-word deep and 18-bit wide dual- 
access register file designed to support other members of 
the Am29300 Family by providing high-speed storage. It 
has two write and two read ports for data and four 6-bit 
address ports. Two address ports are associated with each 
pair of read and write data ports, one to read data and the 
other to write. The device is capable of performing two 
reads and two writes in one cycle. The 18-bit wide register 


file allows storage of byte parity to support parity check and 
generate in the Am29332 32-bit ALU. Independent control 
for each read and write data port allows the Am29334 to be 
used as a high-speed shared memory or as a mailbox for a 


. multiprocessor system. The device is designed with an 


access time of 20ns. It is housed in a 120 lead-pin-grid- 
array package. 


BLOCK DIAGRAM 


DUAL ACCESS 


RAM 


64x 18 


<—_] Les 


<_]} OEg 


\ fo 
18 


BD003022 





This document contains information on a product under development at Advanced Micro Devices, Inc. The information is intended to 


hetp you to evaluate this product. AMO reserves the right to change or discontinue work on this proposed product without notice. 
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Order #05731B 


RELATED PRODUCTS 











Am29323 
Am29325 
Am29331 
Am29332 


Am29334 
REGISTER 


Am29331 
16-BiT 
SEQUENCER 


MICROPROGRAM 
MEMORY 


PIPELINE 
REGISTER 


CONTROL 
SIGNALS 


Am29325 
32-BIT 
FLOATING POINT 
PROCESSOR 


Am29332 
32-BIT 
ALU 






Am29323 
32 x 32 
PARALLEL 
MULTIPLIER 


AF003480 


Figure 1. Am29300 Family High Performance System Block Diagram 


PIN DESCRIPTION 


Input 

Read address for Ya 
Input 

Read address for Yp 
Three-State Output 
Data output A 
Three-State Output 
Data output B 

Input 

Write address for Da 
Input 

Write address for Dp 
Input 

Data input A 

Input 

Data input B 

Input 

Latch enable A 
Input 

Latch enable B 


14 power pins. 
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Input 

Output enable for Ya 

Input 

Output enable for Yp 

Input 

Common write enable A 

Input 

Low byte write enable A (bits 0-8) 
Input 

High byte write enable A (bits 9-17) 
Input 

Common write enable B 

Input 

Low byte write enable B (bits 0-8) 
Input 

High byte write enable B (bits 9-17) 





FUNCTIONAL DESCRIPTION 


The part has two read ports (YA 0-17, YB.o-17), two write 
ports (Dao-17, Dpo-17), four addresses (ARAo-s, 
Awa,0- 5: ARB,O— 5: Aws,0 — 5), two latch enables (LE, LE), 
two output enables (OEa, OEg), and six write enables WEac, 
WEat, WEan, WEgc, WEpL, WEgy) that allow writing of data 
into one or both bytes of a word. The separate read and write 
addresses facilitate creation of three and four-address archi- 
tectures and allow address set-up and RAM access to 
overlap. 


Since the A and B sides are identical, only operation of the A 
side is described. The address multiplexer provides the RAM 
with the address ARa when WEac=HIGH and with the 
address Awa when WEac=LOW. Internally the part is 
designed so that there is no race condition between the write 
address and the write enable. In most cases WEgc and LEa 
will be connected to the clock as shown in Figure 2 so that 
reading will take place in the first part of a clock cycle and 
writing in the last part. The latch at the output of the RAM is 
transparent when LEa = HIGH and retains the data when 
LEa = LOW. The latch has a three-state output Ya controlled 
by OEa. Each word is split into two bytes of nine bits that can 
be individually written. The low byte covers bits 0 through 8 
and the high byte covers bits 9 through 17. One or both bytes 
of the data at Da are written into the location given by Awa 
when the common write enable (WEac) and the appropriate 
byte write enables (WEaL and WEa,H) are active. 


CP, WEac, LE, 


READ AND WRITE 
AODRESS SELECTION 


WEan, WEaL 


READ DATA 


WRITE DATA 


Two special cases arise. First, if a location is written into and 
read at the same time, the value read is the value being 
written. Second, if a location is written into from both the A 
side and the B side, the value written is undefined, but the 
operation is not harmful. 


Extension To Four Read Ports and Two Write 
Ports 


A RAM with four read ports and two write ports can be made 
by using two dual access RAMs and connecting each of the 
write ports, write addresses, and write enables in parallel for 
the two devices. As an example, this RAM may provide data 
storage for a data ALU and an address adder as shown in 
Figure 3. A location should not be read before it has been 
written into for the first time as the contents of the two dual 
access RAMs are likely to be different upon power-up. 


32 Words x 36 Bits Single Access Ram 


It is possible to convert the 64 words x 18 bits dual access 
RAM into a 32 word x 36 bit single access RAM by storing the 
upper half of the 36 bits in the upper half of the 64 words and 
address these from the A side and storing the lower half of the 
36 bits in the lower half of the 64 words and address these 
from the B side. This arrangement, which is shown in Figure 4, 
does not change the capacity of the RAM, but the dual access 
is lost. 


WF009520 


Figure 2. Read through Ya and Write through Da in a Single Cycle (Two Bytes) 
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Da Dp | 
DUAL | 

ACCESS ACCESS 
RAM RAM | 
Ya Ya | 
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Figure 3. RAM with 4 Read Ports and 2 Write Ports 


Dig-035 = D9-Dy7 


{|_| 


Yia-Y35 Yo Y17 
LS001790 


Figure 4. 32x 36 RAM (Single Access) Using 64x 18 Dual Access RAM 
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